Node.js at Walmart: Going to production, debugging, monitoring


Wyatt Preul, Sr. Software Engineer, Walmart

One of the things we do is we try to be as open source as we can so whenever we're creating a new application, we take a step back and see if we can open source anything first. Maybe if we have business secrets, we might try to pull that out into a configuration just so that we can get a module out there, and one of the reasons we do that is all of our public modules, we make sure that they are 100% code coverage, so one of the great things with open sourcing our modules is that it forces us to then have a 100% code coverage. We try to do that too internally, but we're definitely better about it when we have a community behind us saying, your build is breaking obviously here in public. And then anytime we are going to production, I found it really helpful just to use GitHub issues to create our deployment plan, assign somebody to install Mongo for us, so just using GitHub issues and milestones have been really helpful with that, and then for our builds, we actually just use Jenkins.

So we use that, it creates a Tarball with all of our artifacts, and that's what we do. And since we're using Hapi, our actual deployment artifacts are very simple. It's just a configuration, and then a package.json. And another couple of notes that I've added here, it might be a little different than other production environments.

We tend to not use like the cluster module. We'll just use our load balancers, and open up more ports with processes, listening on those ports, and we also tend to favor lots of smaller VMs with maybe only a single Node process running on it, which might be different. It just makes sense for us. If one goes down, it's not impacting a lot of processes.

And then when we're actually getting to production, we just use bash. We just use plain old bash to push our code up, and then in each of the tarballs, we do have some bash control scripts that just start up the node instance, and can kill it, and if you don't get anything else out of this talk, I want you to write down that third point.

This is something we learned a little bit too late, but I think it's super helpful. Pass that abort on uncaught exception flag when you start node up. What that will do is, it will drop a core file for you if you have uncaught exception, and then you can go in later with MDB, and to see what exactly was going on, so that's been really helpful. And then also when we run in production, we log all of the conosle to standard out to log files. That's been helpful too if there's an uncaught exception just to have that information in a log file. I'm sure that's pretty standard, but I thought I'd mention it.

And then for monitoring in production, we use a module called good, and it's a Hapi plugin, and it allows us to watch our heap memory, just RSS memory, and watch our CPU use over time, and that's pretty basic. But then it also allows us to monitor our garbage collection counts. That can be very helpful, and to do that, we use memwatch, I don't know if you guys have tried that out, but since we are using memwatch you can also pass in a flag to alert on any memory leaks that it finds, and then as Aaron was saying, we do watch connections, and disconnects, and we tie that in with requests like what endpoints are getting hit, how long those are taking to give a response, and then something else that has been somewhat helpful, not as helpful as I thought it would be is the event loop delays, just seeing if there's a lot of activity on a server whenever memory use is high, so that's been pretty helpful. And then the next few slides are just about debugging, and so if we have a problem in production, we have a few plugins that we use, if you want to debug from within the application, so we have Reptile which allows us to open up a REPL on the server and poke around, see what's going on, and then we have a plugin called Poop which Eran totally named, I did not name that. And it makes perfect sense, the plugin will take dumps. Yeah, on here I call it a heap snap shot, you can call it a heap dump. So, anytime it's the same with the abort on uncaught exceptions, except this plugin will drop a heap snap shot whenever that happens, and then we always run furball just to see what version of each of these plugins were running in production, and then of course we just look at log files if we have any other issues when we are debugging, and then probably the most helpful resource for debugging in any of our production issues has just been the core file, getting that core file down, opening up MDB, and then just looking at the stack, and then digging in, seeing what exactly was going on whenever you had an exception, and with that core file in addition to just being able to be loaded within MDB you can run pfiles in SmartOS, and that will tell you the sock accounts, and that's actually happened a few times where our socket account has hit a ulimit, and that's caused things do crash, so being able to look at that and say like, oh, our sockets are 1,024, that's kind of a weird number to be at. It's pretty obvious, it's a ulimit, and then also with the core file, you can use pmap and that will tell you like each memory segment, how much is being used, and by what. And I think, yeah, it's all I got. Alright.

Sign up now for Instant Cloud Access Get Started