Node.js on the Road: Fruit Loops and Cheerios: Frontend Node At Walmart

Node.js on the Road is an event series aimed at sharing Node.js production user stories with the broader community. Watch for key learnings, benefits, and patterns around deploying Node.js.

Kevin Decker discusses the various techniques and technologies that allow Walmart's mobile site to render on both the client and the server with a single codebase.

Kevin Decker, Sr. Mobile Web Architect

Hello everyone. As mentioned, I'm Kevin Decker. I'm in charge of Walmart's Mobile Web technology stacks. So a little bit of background on what we're implementing there. We have a backbone-based single page application that is powering both the site and our UK subsidiary's mobile site.

And we've been using this for about three years. It replaced a Java-based legacy application which had a number of interesting code patterns in it and we felt was no longer maintainable, so we've seen some really good results with this. We have a relatively small team of front-end engineers, you can probably count them on your hand at any given point in time, and we've been able to implement both of these sites and do it in a manner that is sometimes crunched for things like Black Friday. That being said, there are some places where we feel like it's not quite as solid as we'd like, or at least as of the beginning of the year it wasn't as solid as we'd like. And this primarily comes from us serving an effectively static page with a couple of script tags. For users like Google and general users, this has issues with both SEO—initially Google hadn't made any formal comments on what sort of support they have for scraping dynamic content.

They have since made some comments to that effect, but we still don't quite trust that Google is seeing what we want them to, until we have a point that we can actually debug what they're seeing and know for a fact that it's being served properly, we still don't trust it. The other issue applies to both Google and all of our users, is the delay that you can have with the initial rendering of the content.

You have to download all the JavaScript, you have to execute it, and then finally you can see some content as a user. For small pages this is fine, small pages are on reliable connections, you're not going to really see any major impact. But for a site that's developed over three years and also primarily targets the horrendous mobile networks we often encounter, it's not that good.

So here I have an example of the two different modes that we have in place. On the left is the Client-Side render, and on the right is the Server-Side render that we have since implemented. And I would like to add the caveat that these are throttled to a 3G connection, so it does load faster unless under ideal conditions. But if you watch the server rendered page has something there for the user much faster than the other page. There's a spinning indicator, but the user otherwise just doesn't know what's going on.

They know something's happening, but they don't know what their end experience is going to be. So rather than waiting for everything to be available, we push what we can. Anything that's public, and satisfies our caching rules for that page, we're going to render on the server and push it to Akamai and it will be available much faster than it would otherwise. Then on top of that we use our client side JavaScript to augment the page, which you see in the personalized, you might like section down at the bottom.

So knowing that we wanted to move towards this server rendered direction, we kind of felt that there were three different choices that we could take. One would be to either re-implement the pages that we care about this on, throwing out the existing implementation, or forking the implementation and having to maintain the site twice, which was a non-starter for us. The pages that we cared about this on are our most important pages, being the item page, home page, and other things that have a fair amount of business requirements behind them, and just would not be pleasant to try to maintain in that manner.

So, throwing that out we looked at Phantom.js, which we already had experience with for our CI environments, and there is many people that are in this space, some actual people that have startups behind this, so it's certainly a valid approach. But for our environment with our caching rules and the scope of our product database, it felt like we wouldn't be able to actually cache anything. They would all be live hits, and we would have pay the performance price of spinning up a Phantom instance, or have to create some creative pooling to make this work.

So the final solution that we ended up coming to was close to the Holy Grail that has been thrown around so many times when speaking about front end and Node environments, and that is, you write it once and you run it anywhere. In reality it's not quite that simple, but it's very close to that, and I don't think anyone is surprised by this, but this is all powered by Node and most of it is open sourced.

The stack that we're using is based up of hapi for our HTTP routing, which was developed by my colleagues at Walmart labs. Bullet proof as best we can and stress tested, so it's pretty stable at this point including our Black Friday release that created so much fun for everyone. Also, we're using Contextify. The sole purpose of this is we need to have a isolated environment that we can run our clients like JavaScript in, because there's different requirements. Things like set time out, we have to have operate in a slightly different manner in order to have them operate effectively, which I'll get into it in a second. But the reason we went with Contextify here was it's the path moving forward for Node VMs. It's by my understanding going to effectively be the new API under the 0.2 instance.

So assuming everything works as planned, we're able to swap out the modules and shouldn't have many concerns. So moving on to the actual API—the browser-level APIs—we didn't feel like DOM or implementing the DOM itself was a valid approach. There's so many things that require ES5 level features that have a performance impact, at least from our profiling that it didn't feel like it would ever scale to the needs that we have, and also there is just so much minutia involved with the DOM over the many years that it's been around, that trying to recreate something like that would be quite painful. And that's not saying that our approach that we went with of implementing a jQuery API doesn't have it's own minutia, becaues it certainly does, but it's much easier to implement and the code is inherently ES3 based, since that's what it was spawned in.

So Cheerio provides us with the runtime or with the basic DOM API or DOM equivalent API, but we still have the problem of actually executing the page, which is where our Fruit Loops library comes in. And Fruit Loops, if anyone is wondering, is sugary cheerios, so it's designed to be an extension on cheerios to provide browser lifetime support, so things like loading your page, loading scripts, handling things like basic history and location, etc. But also more importantly it provides execution tracking for your code. So this is modeled somewhat after the Node event loop, but as your page executes, it's difficult to know when it's actually completely rendered. But one way that we've gone with is waiting to see, waiting until the point that there's no further code that is waiting on execution.

So if you have a time-out, it's going to wait until that timeout completes. If you have an ajax request, it'll wait for that as well, but once all those are completed, the page will be returned to the caller and eventually pushed out to the user, and this gives us a very simple way to handle the code in a generic manner. Basically the only caveat that we have is you can't use set interval. That's not implemented, because obviously set interval will kind of kill your server. And I do need to note that all of those libraries are more or less stand alone and independent. They make no assumptions other than the jQuery API assumption about what your actual front-end environment may be.

To assist those that are using our environment, we have the Hula Hoop library which takes opinions and says, hey you're probably using a thorax framework which is a backbone extension library, that effectively makes it very easy to start up a server and run with that, if you are using the stack that we have in place which is all open sourced. There's very few things that our generic technology concerns that we don't open source, unless legal gets involved. But Hula Hoop's pretty simple to set up. Just a Node module; you tell it what you have. The resource loader is based off of both directory paths as well as some metadata.

You can construct this yourself, but the Lumbar build tool that we have also assists with this process. Then you create your page handler which does all the heavy lifting. And this has the nice ability of being able to dynamically switch between your client side and server side mode, so either you can have a configuration parameter so you A/B test a particular route on server side versus not as you roll it out, or alternatively should you have a code error and it all blows up, it's still going to fail safe from the users perspective.

It may fail downstream, because it may be a services issue, but at least you can push things out, so the user has the best opportunity of succeeding. Then once everything is registered, you simply set up your Hapi end points, it's pretty simple. We decided to not be as opinionated here. We just provide the handlers rather than specify your exact expires, because you may not have a long expires system in place or various other concerns, and then you also need to set up the actual route handlers, which the page will serve.

There is multiple ways to this. We had to do it on a route by route basis for some legacy concern that we have, but you could also do this as a star path and just run it all. So this is all fun and cool technology as far as I'm concerned, but it's not without its issues just like any other technology.

Most of this is very much bleeding edge. I believe two of the projects are still in pre-release mode on December, those being thorax and handlebars, and then many other projects are still on 0.releases. It is being stress tested in our production right now, but we will have to spend a bit of time to make sure that everything is polished to the extent that we need it to be to make it a formal 1.0, or the next non-prerelease version.

So outside of that, if you're willing to make that jump or you are on a project that you just want to play around with it, the other issues that we ran into was the team best practices. When you're taking a relatively short lived front end application and making it long run, there's many, many things that can pop up particularly around memory management that you didn't have to worry about before, and may not have been found through you QA or other development processes.

So we had to retrain a number of bad habits that we had in our development process, but the code ended up better from it. And as I'd noted, memory was one of big things. There is some issues around contextify. You have to use it in a very particular manner to make sure that it's not going to cause extra stress on your garbage collector, which as I found out today is something that we should be able to resolve, but we'll have to resolve as we work towards the 1.0 release of this.

And then other general concerns would be trying to avoid old space whenever possible with these long running applications. There is just a lot more stress on the garbage collector for these cases. And this isn't related to Node at all, not really related to this particular project, but one thing that we found was Akamai's private caching isn't actually private caching. They will transparently upgrade you to public caching, which, had that made it to production, it would have been very bad, but luckily we found that and so I'm just simply trying to make sure everyone is aware of that.

So as I kind of touched on earlier, the conditional behavior is necessary, both for things like best practices around event handling. there's a number of things that don't matter on a server. You're never going to have a mouse down button on the server. Also you generally want to avoid things like that.

You want to avoid registering handlers or any other code that is dependent on that. Just do the overhead also things like loading indicators and other things that are designed to provide psychological performance to the application actually are hindrances on the server side. Since a user doesn't see that interim state where you have a loading indicator, they only see the delay that that may cause in the final response to them, so we need to optimize things like that out.

If you have enough CPU power or it's very minimal overhead, you may not need to do these things, but we found that it was very important for us to do in our environment. The other thing that I think was the biggest problem (and this applies to any environment where you have a complex client side application that is tying to server generated HTML) is how exactly you associate the views or whatever JavaScript space data that you need to associate with that HTML, and this is for things like event handling, and analytics, and whatever you may need.

Within thorax, we implemented a heuristic that allows us to track the behavior or to track the data that is used to render the view and then we pass this to the client. It's certainly not foolproof, but we have error handling in place that will automatically re-render the content as necessary should you hit one of the pathological cases, and all of these are very well documented in tests, so at the very least we know what's going on when. And a slide that does not fit the page. So one of the problems that we were very worried about and it turned out to not have been a problem for us yet was our CPU utilization.

This particular problem has a a lot of CPU usage relative to something that's just simple string concatenation and passing it on to the user. This is a trade off that we felt was worthwhile given the design of our app as it stands at the time, but there's additional CPU. To our surprise, even when we disabled caching completely we wanted to see what would happen, we never really hit a load average much higher than 0.25 on average, which there's some mysticism to load average, but that's roughly 25% of CPU at any given time which was far better than we expected.

We expected to have concerns where there would be a pool of waiting requests waiting on our queue and other pathological cases like that. And a lot of these concerns came from our pre-release stress tests, where we removed all I/O from the system and cached it in pretty aggressive ways, and then we threw a number of concurrent requests at the system, and saw some very, very scary numbers.

It turns out that this is not the right way to stress test a Node environment, since it is so inherently linked to I/O bound processes. If you try to remove those and do something that's CPU only, of course you're going to see problems. But whether or not those problems are going to actually impact you in production, you won't know if you make such a synthetic test.

So with that, and assuming that this all works out well, I have a bit of a demo. So this particular project is our thorax seed project, which provides everything that you need to get up and running in this environment. There's also a branch on this project that is the exact code that is used in this demo.

Right now we have a very simple list view that displays, I believe, public repositories on GitHub, along with some sort of number. It's been while since I wrote this. So if you look at the content here, which I'm hoping that this will actually turn out. So the content here has very little to it. Just some bootstrapping tags, and that's it. But by changing this and ignoring the lint issues, we now have all the content that goes into this page ready and available for anyone who needs it, be it Google, or be it end user, and also we have the inline API response.

So it's immediately available, unfortunately this one is highly unoptimized. You'd probably want to have some sort of filtering proxy in place in front of this to get it down to something that's less than over 100 KB, but generally it's been about that level of effort once we got all the frameworks in place.

It's flagged the things that you need the flag, if you're doing simple lists, or anything like that, it's probably going to render fine. We have had a few places, such as our carousel which required knowledge of the layout, which isn't available on the server at all, where we had to have custom restore logic, but even that hasn't been that painful.

And assuming that I can find the right tab, also you can see that this 9062, whatever that number is (I assure you it's not math.random), that's being pulled from that service response. So you have full interactivity, and I don't have a demo here unfortunately, but you can also interact with this list content, re-render elements, insert new ones, and it all generally works.

So with that, I have links to the number of, to the projects that are involved here. I will also post these slides afterwards, and thank you very much for your time.

Sign up now for Instant Cloud Access Get Started