Node.js on the Road: Node At Walmart

Node.js on the Road is an event series aimed at sharing Node.js production user stories with the broader community. Watch for key learnings, benefits, and patterns around deploying Node.js.

Experiences with Node in production including issues, techniques, and general advice on moving from development to production.

SPEAKER:
Lloyd Benson, Architect
Walmart

Thank you again for everybody coming. I work at Walmart. Who am I? I'm not really famous like Eran or Ben or those kind of guys. I tend to work a little bit more behind the scenes. So I'm an Architect at Walmart, and essentially I do a lot of stuff around Node. So I do some development in Node, but I did all the release processes, I did getting the load balancers set up, all of those kind of things to make it ready and whatnot. And I'm the guy that actually was doing the button pushing during Black Friday, during our Node release, and we felt very comfortable with that.

So that's kind of my claim to fame.

Just a quick overview of
what I'm really talking about. We're really talking a lot about frameworks today apparently. So we have a framework called Hapi. I'm going to talk a little bit how we got started, and I'm going to kind of spend my 20 minutes actually giving you stuff I wish I would have known two or three years ago before starting this endeavor. So like real, here's how stuff works, and you can actually use these. And sort of how to simply scale, things that you may want to monitor about it, if you have an issue what do you do, and sort of final thoughts about our experiences in Node in general.

So we have this framework called Hapi, and it's Hapijs.com. The site just recently went through a revamp, so if you thought, well this is just an API before, it looks slightly better, so hopefully that's useful for you. It's a framework for building applications and services. It's completely opensource, and it's been opensource from the very beginning.

It's a little bit different than Express because we're more worried about configuration, so everything that we can stick in the configuration we do that instead of—so you can work on your code. And so it tries to get out of your way so you can focus on your business logic. It has a plugin architecture, and when you hear plugins you always think of, there's this a huge place I can go to to get any plugin that I want, but that was actually not the original intent for plugins for us. It was that we have these giant teams, and if I had checkout or a cart, or lists or something like that, I want to kind of work in my own little world, so Hapi has this concept of server partials, so you an actually start up and work in that repository, and then when you actually need to put it into production you can glue all these things together, all the routes will—you can work on it independently, but you bring them all together for your actual deployment.

And there's places in Hapi where if you need to interject your own custom code kind of within the request life cycle, you can do that at any given point. We didn't really set out to get to the framework business either, but as it turns out, we're kind of glad that we did and now that it's here we use that. So anyway, so we have a, this is Spumko and we have all these kind of third party modules that, the great thing about the Node community right is that you do everything kind of in these small chunks, and when you do these small chunks you may not need cookies or you may not need, you may not like this portion of it so it's kind of a minimalistic approach, here's the basic functionality that you need and here's some extra plugins that we maintain a lot of these and the community maintains part of it. Some of these are utilities, some of these are additional actual plugins to extend Hapi, so as you can see here, I don't know if you guys have noticed that since I have so many people to talk to, I got an opportunity to actually mention the word poop so I'm really excited about that, and if you're familiar with the children's book, everybody poops, well so does Hapi.

So getting started.
So how did we actually get started? Well we started out, we already had lots of customers, right? So it's not like we're starting a new project here, and we're slowly ramping it up and doing all that stuff. We have lots of customers, and we need to throw Node in. So that's a little bit of scary prospect, so what we did is when we started out Hapi we made just like, what's the most minimal thing that we could do? So we used a proxy and then everything that routs through the layer, everything's going to go through Node, and then from Node still talk to the backend stuff. So after that was successful, we started adding analytics, so we said OK what are all the routes that are going through here, what are the slow ones, how can we improve it and then we started actually starting taking over routes.

So, this taxonomy thing is like really really slow, it's taking like 20 seconds. I want to try to put that in Node, and all of a sudden everything is really fast. And then after we started, and we still are today continuing to take over routes. And then we moved our web component in there and we are kind of like our whole web tier, we're moving that away from the Java land, and then we put it in Node, and we're expecting this huge, "I better get some extra servers," and all of a sudden we put the Node tier in, or the web tier in, and like nothing happened.

It was pretty amazing, but the applications, if you are using your phone or whatever that was like screamingly fast, so end people knew it. But from a system perspective I was like, did we even move it over? Is anything going on here? And then after we had like big successes obviously with Black Friday, now the whole kind of organization is using it in a lot of different ways.

And so the proxy was kind of our way to kind of get in, and it was the most heavily used feature, so it's been well, well tested for us.

Releases, how do you do a Release in Node?
Well, this is kind of a simplified version that's been really successful for us. So, you basically install, you test it out, and then you kind of package it up make a tarball out of it or something, you copy it over the server, you untar it and you start it up.

It's pretty straight forward as far as releases go, and this kind of basic plan I was comfortable in the heat of the Black Friday release, I said I'm going to push some buttons, and when we actually see the graphs no one noticed that I was doing any release, so that was a pretty incredible story I think.

My advice to you for doing release is, release often. Don't wait to have everything perfect, and you've done like six months of development and a billion code changes, and then you're like, let's try it out, right? Try to start out really small, and release as often as you possibly can stand. The smaller change sets that you have, the easier it is to go, "oh, I just changed this little thing, it's probably something with that."

We also build once and then push it to all the servers. You could do a strategy where you could npm install all your stuff right from the server itself. Node stuff is so small, that you just kind of, a little 5 MB file and you just throw it out there. And figuring out what to do is often times, in my opinion, a lot more work than just pushing it out.

I would also suggest even though Node version upgrades won't ever break anything, make sure that—we always keep our upgrades as a whole separate release. Especially with our size of transactions, we will notice the tiniest miniscule issue in Node. So, for us it's really important to make sure that, when we do upgrades, to just do that, and make sure nothing ever changed.

And for us a big portion of Node in general configuration is, configuration is all over the place, and so we made a module called Confidence, and it's not really Hapi specific but you can (based on environment or environment variables) you can manage all your configurations at one place and this has been really key for various projects, so you can check that out if you're interested.

Shrinkwrap was briefly mentioned and although it's probably well known, when I heard about shrinkwrap it was "oh, yeah the shrinkwrap thing you can like run an install and then you run a shrinkwrap, and now if I ever want to put it anywhere, I can run NPM install and do basically all the exact versions of everything that I expect."

Well I've never heard this, but I started thinking about doing shrinkwrap maybe using that in some different ways. So Chris mentioned it pretty well, but if you have done no code change and you do an NPM install, and you're using not necessarily just wildcards, but maybe you're doing version 5.XX or something, I don't want to do a specific version personally, like 1.2.3 and just keep it like that. I want to get the latest stuff all the time, so it's a tradeoff about how often you want things to update versus, how I don't want to be changing my package.json file all the time. So in general you can kind of use a shrinkwrap to say what are all the things have changed, and you can actually diff between your builds the differences of all these and say, Hey it looks like that changed, is that OK, maybe I should pay a little more attention to that.

If you're really creative you could actually do release nodes based on it you could say, Oh it's this version lets look up this git hash and do a whole thing on dependencies that you care about. Another trick that I've done is I took the npm-shrinkwrap and I said this kind of really describes exactly what all of my components are because it's not like I have one little project that goes into git and I can grab the git hash. There's like hundreds of modules going on here, so I said well this file really describes the uniqueness of my artifact right, so I'm going to take a sum of that. And now if I build and I have done no code change, and I build it again and a little dependency changed and I'm like, oh my hash changed, so that's been really, really useful for us.

So, how do you scale this? Well, its pretty boring, its kind of what you'd imagine, we horizontally scale and TJ mentioned this a little bit, you can do it either—we have projects where you can just have multiple process on the same box, and you could use a process manager like PM2 or something like that; or you can get a container and do like a one CPU, three or four GB type VM, right?

And you can just treat it as an entity relative to a load balancer, and it can be on the same physical box but relative to the load balancer just looks like a bunch of boxes right? And that way you have just the same ports and everything else and you don't have to say, I have these four ports or these ports and this server, and you want to simplify your management.

And then it's always a good idea, especially if you have a lot of customers, try some load testing if you can. Just some basic load testing, you're not going to catch everything in prod. Anybody that's worked in any production environment knows that as much preparation as you can do is good, but you're not going to catch everything. And Node in particular, I've noticed things you can run into like people run into like weird arrays, or you're getting things in APIs that you weren't expecting, security scans are really good at this. And then boom, your thing crashes, and you really want to make sure that you have some sort of auto restart feature. If you're using SmartOS, you can use SMF; if you're using something like Linux, you could use Upstart; you can use, again, something like a PM2 to kind of do these kind of activities.

As long as you're auto restarting, that's good, and it can save from things like DDoS attacks and whatnot if you're getting hammered.

Monitor.
I get this question a lot, like what do I monitor? I'm so used to doing Java apps, what's important to kind of keep track of in Node? Memory is probably the number one thing for us anyway. Application data (and I'll talk a little bit more about what that means), application tests, doing those restarts that I talked about, open files can sometimes be occasionally helpful, and just your general OS/system monitoring that you already know and love most of the time.

RSS memory for
us about 3 GB was, we started having problems around that so we had heard 1.7 GB and we had said, lets take it up a notch right, so we actually ran like 2.8 GB and things were actually working pretty good. But when we start getting into that kind of the three GB upper limits, we start having kind of performance problems.

Heap is especially important to watch. That's about a 1.5 GB upper limit. You don't really want to wait until you get to 1.5 GB heap, because the garbage collector gets really aggressive and your performance will start hurting as it's trying harder and harder to fit your memory in. And Node is pretty good at actually staying up, and in some ways that can be a little bit of a detriment because your customers are really suffering and things are going kind of slow. So what you really want to do, is you want to start trending to see what your kind of load average is, and figure out historical trends on exactly what kind of memory that you're used to, and set those thresholds accordingly.

Don't look at this slide and go, I'm just going to set my heap size to 1.5 and alert me on that. If you only normally do 200 MB or something like that, that's what you should be setting your thresholds for. And Hapi has actual configuration for max sizes for RSS memory, Heap memory, and the Event Loop. And for us we deal with a lot of APIs as you can imagine in a big company.

So some of those APIs really suck, and so we may have to wait a long time to get a response. This can be problematic for Node because it's on the Event Loop, and I'm waiting, and then the Event Loop can build up. So we actually have a feature in Hapi to say OK, I've waited like 30 seconds. This is way too long for this API, so I'm just going to give up, and that way your Event Loop doesn't fill up.

Application Data. Application Data is generally just kind of your requests, your responses, the errors that happen, ops data related to your instance, and basically you just take all that information, send it to some sort of aggregator, whatever you prefer: Splunk, LogStash, your Big Data, Cassandra, whatever you want to do.

Then you look at dashboards right, and you kind of look at all of this performance information, what are our response times? All those kind of things and you'll find that you can actually see your leaks over time. So it's been really useful for us to do that, and within Hapi we have a plugin called Good that kind of gets the requests, the errors, and the ops data and puts it in like a JSON format, and if you want to shove it off wherever you want you're more than welcome to it.

So application testing, we actually strive for 100% code coverage. It's very difficult in the real world to do that. Our open source stuff we especially strive for that, and in fact all of our open source stuff, the checks are for 100% so we always prioritize integration testing over unit testing.

So I know people get kind of crazy with unit testing. We generally say let's follow the whole flow, and whatever we need left we'll throw into a unit test. It's really important to test your production, so you have all these great integration tests, and you have this environment up and it does all these, here's the checkout flow, here's the cart flow, all those kinds of things. But actually in production do these tests quite regularly, so you can get the overall health of your system because for us, we have tons and tons of dependencies, and it's a nightmare to try to figure out what's going on.

We actually have our own testing plugin called Lab. You can check that out. If you're familiar with Mocha it's a lot like that. It's just more stripped down and has some pretty cool features, but just to quickly mention we have that.

Applications restarts, I kind of mentioned earlier that these are important.
I've seen in a lot of places where this sometimes, oh! it restarted. Out of sight, out of mind, and I think it's really important to actually check that it did restart because you want to know why it restarted. These problems can sort of compound if you're being lazy about it. So I just encourage you to always check, check to make sure if it restarted that it's a known issue, and then actually resolve the issue as quickly as possible because these things can pile up quick.

Open Files, there's not a lot to talk about. This is true of any pretty much programming language. This was useful a lot for us in the beginning, when we were playing with it. We checked the open files, and if you're doing a lot of things with files or obviously in UNIX sockets or files as well, it can indicate some leaking for you. Generally you don't have to check these every five seconds or anything like that, but it's useful to check once in a while.

And then your basic monitor OS/system stuff. You can either do your system checks, anything that's scoped from the system perspective. You could still have—a lot of times for us, actually some sort of Java client that does some sort of analytics or something like that is taking up 50% of the CPU; our actual application in Node is bored out of it's mind.

So just make sure you kind of take a look at that because it could affect the performance. And for us, we've heard CPU usage is a potential problem because of single threaded Node, but during Black Friday we were like, did that CPU just get up to 2%? It was really a non-event for us, and really very different from our old Java land.

So now you've done a great job of load testing, but now you have issues, now what the hell do you do? First, this may seem obvious, but you need to look at your code. I'm kind of a systems guy as well, and the first thing I hear is well, it worked fine on my laptop, I don't understand. It's your problem now.

So, the first thing that we did is we actually looked at our code and we solved a ton of stuff when we first went out. As we started going to production, we were tracking all this stuff and it was like, this is our fault, this is our fault, or we could do this a little better. And pretty soon we started getting down to a very minimal bit, and we actually, again because we have such high transactions, we actually found stuff in the Node core that was problematic, and the big one that probably people may have seen, there's this link if you look up like Node Walmart and Node Leak you'll find this link fairly trivially.

This is like a whole topic in itself that you could talk forever about, and I'm not going to do that, but this kind of really highlights the worst kind of case scenario that you can look at, and kind of how we've tackled that problem. And I'm not going to reiterate it, because it's a story that's already been told. So what are some tools, some real basic tools that you can use?

Abort on uncaught exception is huge. And I have mentioned poop before; I'm going to mention it again. But essentially, that was a tool that we had used initially to, when we got these uncaught exceptions in Hapi, we needed to get the data. Now that there's this abort-on-uncaught-exception, it's been really, really useful for us, especially when something happens at two in the morning and no one's around. You can kind of take a look, it aborted, you automatically restart it, and you can take a look at it and say, here's what the core, here's what was going on in memory, why did it happen? Sometimes you'll get it in the log, but not always. And then let's say you actually happen to be around and you're starting to get at these heap limits or these memory limits, you are like well, before my release it was only 400 MB. Now I'm getting 400 MB, and it keeps growing every day. And believe me you'll run into this. You can use gcore to actually get that data, and then you can use the MDB debugger to kind of take a look at, "what are all my objects?" You'll start seeing huge accumulation of objects, so that's been a really useful tool.

DTrace is is again really useful if you need to look at it real time and actually you can instrument your own code to utilize it. And just the really simple, like break down your problem. So when we had a problem with the Node core, we didn't say here's Hapi, here's this giant thousand pages, here you go, can you solve my issue?

And Joyent I'm sure would have happily done that for us, but what we did is we said OK, let's not have Hapi at all. Let's put it in Node, and let's recreate this issue. And it can be somewhat difficult to do this for real, but we kind of broke down the problem, got it simply and pretty much Joyent couldn't not—you guys couldn't ignore it right, because it was like, here we go.

I can reproduce this without any code of mine at all, and then they did. It took them a while. And then, right before Black Friday we got that release, so we had stopped leaking and it was pretty boring for us, which is really good news for us.

Some kind of comments about DTrace. Again some issues are really, really really difficult to troubleshoot without DTrace, as TJ mentioned. DTrace is only available on SmartOS, so you're not going to be able to—kind of, so you're not going to really get this at least now that I know of only on SmartOS. It can be—DTrace is not like something quick you're going to learn in five minutes, so it can be a little bit of a time commitment for if you're just kind of in a small place, so there is the DTrace tool kit which is helpful to kind of get started, but just so you are aware like it's—DTrace is like kind of a—or the MDB debugger. If you really want to get good with that you can have to know some C. So that's kind of where we find that the Joyent relationship is really useful, because, we don't want to spend a lot of time learning DTrace or knowing the—we have some basic stuff that we do, but we are really glad these tools exist because it allows people that have these specialities to be able to debug it. And thanks to TJ a lot for improving Linux in particular, and tools in general are improving Linux too, so Joyent actually can support Linux, so if you're doing something in Linux, Joyent will actually—you can pay them money to do Linux support, which is if you're a large enterprise organization, and you need somebody to blame they are happy to do that for you at a cost.

Monitoring is really, really, really critical for us, again especially with memory. As I've kind of shown releases are pretty easy. You can get crazy with it. I'm more of a "keep it simple" kind of guy, so I don't want to be maintaining these very complicated scripts. So for us compared to Java, development times (and you've heard this from the other people) are really fast. We felt especially with our own framework, which again focuses on getting everything into a configuration, and then your business logic is a lot easier to manage.

NPM is just a really great tool I really like about Node that, again you have these kind of small components you don't have to say, like, take the whole thing you have to have enterprise Java or whatever, and you have to have this huge binary and everything else. It's built on only get what you need. And Node is just super, super fast and it's so fast that it made our other stuff look super boring. Which if you have money on the line, it's a really good problem to have. Boring is good even though it's not sexy, so anyway that's all I have.

Thank you for your time.
:

Sign up now for Instant Cloud Access Get Started