Node.js on the Road is an event series aimed at sharing Node.js production user stories with the broader community. Watch for key learnings, benefits, and patterns around deploying Node.js.
Ben Acker, Senior Software Engineer
I've been doing a lot of duolingo so it's either tu quieres Node or [xx]. So I've made a few promises for how I would start my talk off. Today's presentation by Ben will be brief narrative with slight technical overtones discussing how Node has achieved production at Walmart. There are quite a few other talks about Node at Walmart. They get a lot more technical than the one I'm about to get. These are related here from Node Black Friday. I promised I'd use my NPR voice for a little bit; that's about as much as I can do.
There's a lot of them. These are all—these are just some of them; there's links to them, these are all links to them so when they're posted you'll be able to go in and see them. So to start off with and then set the stage that's a dinosaur you can't see really well. So Walmart thought that creating some type of mobile presence would be a good idea, and they did about as you'd expect, and this is a while ago and they created this mobile application that was designed by a giant corporate enterprise that focuses on retail, and it was pretty crazy.
There were a lot of—the things that were provided to like service tie-ins to get into supply chain where all like SOAP services, they were all—and later there would be other services, but the SOAP were still better than those, and they were kind of crazy. What happened later is there's these Java teams that are creating all of this, and what happened later was they decided that they wanted to make stuff better, so they started getting companies to come in and build these things for them, and they got folks to build an iPhone app that ended up being very successful, and they ended up acquiring the company that made it. They acquired a company for Android. They started doing native apps; they started doing tablet specific stuff for all different platforms, and so at this point you've got this services team that was doing the services for the original thing, and you know they start off, and they are doing OK like they've finally got something—they finally got something that's decent after this other mobile app, but now they've also got loads, and loads, and loads of new clients, and those are not—like they are starting get a little nervous about this because the services team didn't expand, but they've now, they are now servicing loads of different people and trying to turn these giant archaic xml objects into something that's readable and makes sense for mobile, which xml generated from enterprise Java of eight years ago is definatly not mobile friendly. So you've got to come up with ways to make small things quickly to get to mobile, and there's a lot of other problems that you have to deal with, and so then like the main thing is Walmart.com, and then mobile was kind of treated as an add-on later.
So the services that were provided, we've already talked about them being bad, but they were also not supported very well, so folks get angry. And then, once you get into stuff like Black Friday and you start doing loads more traffic, and other services are going down, people—like they were focusing on the mobile stuff even less, so teams are going completely crazy, and what this led to was a real focus on trying to move towards having better services.
And the way that they did that, was a couple of years ago, they brought on a guy named Eran Hammer to start a team and they brought him over from Yahoo, and gave him the option to use anything he wanted to write with, and so he chose Node, and they built a small team to start coding in Node and start revamping these mobile services.
I'm one of those folks, and we started like immediately we were doing open source stuff, we were able to—well, things started getting better. The plan for having Node—at this point there were a few good case studies of using Node like already, like Voxer was around, Linkedin had started using Node, so the plan—and one of the problems with Walmart is because there are so many users immediately for everything, maintaining legacy services was going to be something that needed to happen. So this, this is one of the slide technical overtones. This is actually a copy of one of my old slides, but I love the slide because it's a good representation of the Walmart architecture for the Node roll out.
So, we have these old services that are siting there. We've got those there, and then we've got on top of that, we've got the old website which was, if you all are familiar with the Java framework wicket, so that's him. Wicket is there, and then on top of everything, we are going to chuck a Node proxy. And what that does, is that gets a) that gets Node into production in a place where—in a place where it takes long time just to get hardware requisitioned, much less to get sometimes having new operating systems, or a new technology introduced into this giant corporate architecture. So getting all of that stuff in there is important.
Having it be a reverse proxy so that all the traffic can go through there is important for legacy users and what have you, also this provides a good tie in for us to put anything in there that we want for analytics, like we can easily throw in caching layers, and do all that kind of stuff, so this is was all this was all part of the plan.
One other thing is that I've seen, like I've heard a lot of people talk about how they've put Node in, but not as many folks talk about especially in larger areas, what their build process is, or what their deployment process is, and ours is really really really simple. Our build is pretty much npm install, right?
So all of this stuff I've left out, I've left out some of the lines for brevity, but for the most part, this is the meat and potatoes of our build. Jenkins runs our builds, and just basically npm install, tests removes it, installs production shrinkwraps it, and then tars it up. And then after that, to deploy, we provide a giant array of all the places that we're going to SCP it to, and that is what we do.
We do it in stages, and you know, there's load balancers. There's a great description, a little bit more in depth of it and goes into more detail, that Eran Hammer provides in an article that I believe I've a linked to on here that you can check out, but that's it. So, and this is one of the things that both made the Java team who was freaking out happy, and us happy is that a lot of the stuff that we were doing in Node was ridiculously simple, and what's more than that, it was easy to implement, and we just had loads of fun doing it. Throughout the entire time, we've had loads of help, I mean TJ was talking about the community, and that's one of the things that has made Node super fun to be involved with, is being able to go to meetups, being able to go to conferences, being able to use IRC and talk to folks and ask questions, and have folks offer help and advice, or just have folks get excited about projects, because people are doing loads and loads of fun stuff in that. So at Walmart we were always looking for help, and we were always for folk's opinions, and Eran's always voicing his opinions about stuff, and helping out in different places.
But some of the folks that we had help out, one of them was Mr. Fontaine here. We had this crazy, crazy, crazy memory link when we switched up to .10, and it was ridiculously difficult to track down, and we would not have done so had it not been for Mr. Fontaine. So another is he was talking about the slab allocation, like removing the slab allocation in .12, and like Trevor Norris had come up to talk to us about that like say hey, I have ways that you guys can make the Walmart stuff go faster.
This is one of the great things that I love about our community, is that folks are always trying to help each other out, right? So it's something that I really really enjoy. So, how did it go? So far, it's gone pretty good. One of the biggest tests that we had so far was Node Black Friday. That was last year for Black Friday weekend, we had our small team, it was only, I think it was only like six people at this time, five or six people.
We had a call coverage schedule to cover if stuff went crazy. The people that we were supposed to call, but what ended up happening was basically from Wednesday, all the way through when people started falling asleep on Black Friday, we were on a Google hangout and it was a lot of fun. Like it's one of the most fun times I've had at work, and I don't like ever working on holidays ever.
And like we pretty much just went through and live tweeted everything that was going on. So the stuff that we had set up at this time was that proxy layer that I was talking about was already in place, so all mobile traffic, all mobile traffic was going through Node. Some of it, like I said before, some of it was still being served by the legacy Java services, but everything at least hit these Node services, and we had transitioned from serving the mobile website from the Java tier to Node, so all of MWEB was served from Node.
We have other Node services that were just written in Node, and then these other ones were proxied through, so all mobile traffic went through Node. In addition to that, we had an analytic system completely written in Node, so there's like some HTTP servers sitting on top of RabbitMQ, sitting on top of mongo and splunk, and so that was taking in all of our analytics data from all of mobile and services were using it internally, so like basically any time there was any type of request, there were two to three analytics calls in addition to all the other analytics calls and with over 500 million—like there were over 500 million people that went to Walmart.com, and more than 50% of those went to mobile on one day during that, and all of that traffic was going through Node.
So that was—it was pretty fun and it ended up going really, really, really, well. Can you all see that OK? OK these are, let's see. Did that help at all? It's not getting me anything else OK, so all it says is that Eran had tweeted "breaking news, CPU on one server touched 2% for 30 seconds," and so to give some kind of background into what those numbers mean is like ultimately those Java servers they were providing all these services, those were still there, so this isn't like an apples-to-apples comparison with what was previously serving these services, and Java servers will generally run hot anyway. I mean like ultimately the footprint of the Java server when you first start it, is going to be 2 GB, like a 2 GB memory footprint whereas our Node servers would like top out at—you will see a couple slides in a few seconds, they run on average like 233 megabytes, there's not much there. But ultimately, what this bought us, is like all of that stuff that I said we provided, like that full proxy layer, which gave us some analytics data on the stuff that's going through there, it gave us in some cases caching for some of the servers, it gave us serving the mobile websites, the servers would like spike at 2% CPU usage. That's kind of gnarly.
The latency that the proxy would add was minuscule. It was really rad, so ultimately Node Black Friday was just completely boring. It was like the best kind of coverage to be on, right? Because there was absolutely nothing went wrong, so no spoilers from folks, so that Eran asked a question at one point, so can you all see that basically those just go straight across those lines?
Those lines represent all of our servers and the memory in bytes for them, and he asked folks what was going on, because at the end you see they start to fall off there, so if you know what's going on here, don't answer, but if you're guessing, does anybody know what's happening in these? This is on Black Friday.
What was happening? Anyone? Alright, so it kept going on, and this is more like a wish. Man, it's kind of just too tough to see, huh? OK, so, and then all of them dropped further. That's from when we deployed on Black Friday. Yeah. It was pretty rad. So that deploy script that I showed you before, usually it takes—if we have the whole thing automated, like what happens is, like I showed you there's basically npm install, we test and then we SCP it out to everything. We do it in phases to make sure that we can check everything, so that extends the build a little bit, but ultimately if we were just letting it run automatically it would happen within just a few minutes.
Normally we like to bring a few down, check them, make sure they are OK then bring them down in waves and back up. There was a fix was going to go out for mweb, and we just decided to deploy it in the middle of Black Friday. Lost no traffic; nothing went wrong; everything was great. In fact, this is down from Java deploys which could literally take an entire day, well over 12 hours, sometimes into 18 and then rolling everything back afterwards.
So moving from that kind of deploy to one where we could basically have it automated and happen within a few minutes, that's kind of awesome, so that's one of the really big gains that we got there. I've got all my notes and sketches on graph paper too, by the way, so it's my little set list. So, this pretty much sums up what happened on Black Friday.
One of the most exciting things was Eran started playing One Direction super loud at 2:30 in the morning, and it kind of freaked everybody out. So what happened after this, was, well, everything went well, and right now Walmart.com is—they are completely redoing Walmart.com, and they are like hey, you guys did great on mobile, so now we are expanding to all of our walmart.com, and so we're doing the same thing over there, and we are developing a whole bunch of new services there so that the team is expanding, so it's doing really great, and they brought in a whole bunch of—they like combined a whole bunch of mobile teams that were separate previously, well like three of us, and they combined us into one team and I really wanted to come up with a good name for our team and I was like 'the away team.' It's like good, I love Star Trek and I love, I like to think of the away team that some of us in there and they are like, 'No', and Eran Hammer is my boss and a lot of our open source tools like the Hapi web framework is based on, like all of the names are based on John [xx] stuff from the 90s, and I was like you know what like Hammer Team. Hammer Team is also a great throw back to the 90s, and I would love to be called that, and they were like, no, no dude. And I thought that this was because they had some kind of like super, super rad name, and I placed a lot of faith in Eran, and so this name that you guys have come with for our services team is going to be so rad and awesome. And we're client services. But it's desktop clients, and mobile clients, it's very descriptive.
Yeah, but ultimately it's been, it has been a riot. Like, it's a lot of fun. Now, I came up with the name, my name for this slide is this is my JIFASNIF slide, this is my, Node is fun, Node is a load of fun to use. Ultimately there are enough case studies to show that Node is ready to be used anywhere from doing like a start up to giant enterprise companies, right?
So if you're interested in learning it at all, go ahead and start. Just even asking questions anywhere gives us a chance to grow the community, because it gives folks in the community a chance to answer them, and step out, if you're already in the community, you can step up and help answer those questions, you can help out with the kNode project, help out by providing just whatever you want.
There's stuff on IRC, stack overflow, there's loads of stuff that can happen, and there's also loads of communities even within the Node community. There's hardware, there's server-side stuff, like what I do mostly. There's loads of folks on the front-end and there's the meatspace folks, which is like a great community also.
This is another result of a promise that I've made, but ultimately just start up. Join us. Join the Node community. It's ready. That's all I got folks.
Node in Production
See techniques for deploying a large-scale, high-uptime production cluster.