Node.js at Walmart: Introduction

Speaker Eran Hammer, Sr. Architect, Walmart

I'm Eran Hammer. I'm one of the Node engineers at Walmart. I lead the team that is working on most of the tools, and the basic foundation that we're using to build all our other services. What we want to do today here is give you a few talks, they're going to be not quite lightning talks but little, but they'll be pretty short, to give you a taste of the different areas that we work on, different practices. It's kind of hard to pick exactly what will be the most valuable to people.

So we're going to try to touch a little bit about how we build back-end services, how we're using Node to drive our front-end tools, the methodology we use to actually build components, how we break our applications into components to work in separate teams. We'll talk little bit about our production experience, going to production. That's always a painful part of writing software.

And we also, which I think is a really cool treat, we have been fighting this memory leak for like six months and I feel like there's the childhood story about the boy who put a seed of radish in the ground, and kept saying like, it's going to grow and like his mom came in, it's not going to grow, and then his dad came, and I kept saying there's a memory leak, and it's not mine, like no, you have a problem, this memory leak is not mine and I was like, I think I proved it it's not mine. I think because we still don't know where it is, but it's a really a cool story about how we went about first of all identifying it, reproducing it—took four months to reproduce it outside of production, and then TJ from the core team is going to take over and say the work that they are doing now to actually try to fix it, and it involves some really, really cool stuff, debugging, and core dumps and looking at heatmaps and pretty charts. We'll do that at the end.

So, the first thing I want to talk about is really like the most important part of our Node operation is community, like that's really the heart of what we do. And it's an advice I keep repeating to anybody whose company is building any dependency on Node is, you kind of want to do three things.

One, is you want
to have a member of your team who's active community member. That's not something that you outsource or bring a consultant in; you want to have somebody whose hooked in, who goes to meetups, who goes to events, who has a bunch of open source things, like you want someone who has a public face. It's really important, especially with this product this young.

Two, is you want to, if you're in a company, like Walmart, if you're investing in a technology that is really critical for your business, and if something goes wrong with it, you're screwed, you want to have a support system. And for us that support system is basically the Node core team, which are amazing, and we're relying on SmartOS with Joyent to give us the extra benefit of going deeper, and analyzing core dumps, analyzing problems that are beyond just understanding the JavaScript part. And on the core team side—I'll tell you what my approach has been. Not only that we go to meetups, and we talk to members of the team, and we hang on the IRC channel, and [xx] as much we can, but also build a personal relationship with the people that you depend, I mean it's a good advice in life in general, but hey, go take people out to dinner. Like I'm not kidding, like have a meet, come to San Francisco, or if you're in Netherlands, the libuv guys, and take them to dinner. Buy them something nice. And it's not a bribe or anything, you are not buying them, they don't really, if anybody thinks that they can influence what Isaac's priorities are by bribes, they're mistaken.

[audience] Anybody thinks that they are Eran Hammer.

But joking aside, if you're depending on this technology at this young age, you're going to find problems that will require involvement from the core team and from core community members, not necessarily just the people who are core contributors, and Forest right here, if you have domain problems, he's probably the one you need to help you fix it. And like really those are the things that requires—yes, you can pay a consulting company and they will help you with out some bugs, but that's not a sustainable model for you long-term.

Long-term, if you're depending on this technology at this point, you want to embed yourself in the community in some sustainable way and building a personal relationship is probably the strongest best thing. You all know that like when you meet someone in the meetup then afterwards you put a pull request on their GitHub project, the chance of that being accepted is like 500 times more than if you just show up out of the blue and say hey, can you fix this thing.

Because, oh, I remember this guy, now you feel bad if you're ignoring him. That's really important.

So the way we use Node
is basically we have a whole bunch—I was really busy with preparing for another talk for Real Time conference that was like a four months project and I was like, I'm not going make it big for this talk, I'm just going to just wing it, and last night at 2 A.M. I was like, I need to have some cues for myself, and I was like pigs, I'm going to put pigs on my slide and it made sense last night at 2 A.M.

these are mostly for me. There's very little useful information on my slides, this is my notes. The way we use this is we have a large established Java services layer, some of them running on a variety of platforms, consisting primarily from e-commerce APIs, from the existing website, and their fulfillment infrastructure, and then we have other APIs coming from the brick and mortar stores, and that's US alone. We have other regions.

And the store API are running on mainframes, and all the technology .com site is slightly more modern, you know latest 1998 technology, and these stores were not designed for mobile, and they're not designed for very quick turn around; they're not designed for service oriented architecture. They're really designed to build a web page on the server, and then serve that entire page experience, and on mobile basically you're doing a lot of API codes. You have a static client, whether it's installed or a one page app on your phone and you serve the data.

So to adjust those existing APIs, we have an interesting Java mobile API layer, and that layer is reaching its end of life. Basically it's hard to extend it so we decided two years ago to replace that with Node, cause we've figured out the sweet spot for Node is really to be an orchestration layer.

It's not doing a lot of calculation, you're basically building a glorified proxy, that's doing some live data manipulation, some adjustment, and allows to get API from different teams, make them more uniform or consistent, and that's primarily the area that we're using Node. And what we're getting from that is first of all, it's giving us a migration platform.

Migrating from an old system to a new syestm in enterprise, it's a pretty painful process. So being able to have a very smooth way to pick whatever services you want to migrate, and just work on those while the other one, the existing ones are still there, it's a really convenient one. We have taken a proxy approach where basically all the traffic goes through Node, and then Node looks like, do I know how to do this?

No? OK, go back to Java. I know how to do this, I'm going to serve it back to the client. And so as long you keep that interface compatible, you can do in-place migration of individual services, and you can migrate just what is meaningful to you, where you have value. You don't have to go and do a big swoop migrating everything all at once, which is painful and rarely works, and if you're interested more about that, I gave a talk six months ago at the Node meetup which is on the Joyent blog about using Node as a proxy and the whole migration strategy. So if it's interesting to you, you can find more details there about that particular aspect.

We're using it for analytics. Because we're pumping everything through Node, we're actually getting the client perspective of our APIs versus what we are getting reported from the actual back-end Java services. The Java services seem to be a little biased about their own performance—and I'm talking about the machines, not the people doing it—and so we're getting more accurate, more reliable results coming from the proxy layer. It's not unique to Node, it's just that's how we're using it, but it also gives us a perspective of disconnects. Node is very good at alerting your whenever something reconnects, disconnects—the events are all there, so it's so easy to know when someone terminates the request in the middle, when someone stops listening to a response you're sending them, especially when you're dealing with mobile clients, it's really, really important to get that data, and I can actually show you live data later on.

Performance, we're using Node as a way to basically work around API limitations whether it's localized caching, batching requests so the mobile client doesn't have to make multiple round trips, so it's really a classic orchestration layer that we're building on top of existing legacy APIs.

Productivity, that's a huge one.
We have multiple examples where we were able in hours to take an existing API and write a brand new one, whether new API or compatible API, and deploy it to production all within 24-48 hours cycle, because JavaScript is so convenient and there's no build nightmare, and all the problems that were familiar, especially when you're dealing with a large company that has a big established release process and by using JavaScript and Node, you can really cut the time on releases. And the fact is that, doing a rollback is really easy, because if you're releasing one JavaScript file, you can just roll back one JavaScript file versus if you have a big Java build, you can't just go in unless you have the exact version that was there before, you can't just switch it and restart because you might have multiple fixes that depend on each other, and it's much easier to do in the modular environment of Node than in the compile environment of Java.

And last, it's,
and that's something that we're just starting—this was like the big promise of Node, three years ago, it's like oh you can use the code in the back-end, you can use the code in the front-end. That hasn't really proved itself to be valuable for most people, but we're now starting to see value for that in areas like we have a new A/B test environment, that we're using where we just hack it together for our mobile web environment and we can move the processing of the config files which in JSON, in the server, or in the client, we can put anywhere we want based on that, and also Kevin Decker will later (he's our lead architect for our mobile web experience) and he'll talk about how we're using Node as part of the toolchain, so we're building our front-end experience using Node, and that will give us a path later if we wanted to start doing more server-side rendering, or for one page app, or optimization for search, and stuff like that.

Our stack is pretty straightforward. We're using mostly SmartOS for our Node in production. It's been a pretty nice experience. DTrace and MDB are just magic, and if you have a production issue and you don't have those two tools, you're significantly handicapped and later when you see what TJ is doing, most of the magic is basically spending all day inside MDB and really digging in to labels that are otherwise hard to see, and if you want to know, like if you have a memory leak, and I guarantee if you have a brand new Node application in production, you have a memory leak, guaranteed, and if figuring out memory leaks without the Dtrace, the MDB JavaScript object list, it's not fun, you actually have to read your code.

Next on our stock, we've our Hapi. Hapi is our open source Node server framework. It's part of an entire package of—we have about 30 open source modules that we support, and we're using most of them in production. And we decided to build hapi as opposed to express, mostly beause of the reason that Ben Acker is later going to talk about, in terms of the plugin architecture.

We felt that the Express is awesome, and very, very fast, and lean and mean, but it also means that if you have a large team with different distributed organization, Express mean that everybody has to keep changing the same file that has all the middleware loading, and if somebody squeezesone little word out, whole thing will blow up. It's not really an environment that's easy for a large distributed team to work over time. It's great to hack on very fast, but later on when you're a company like Walmart, it's kind of like your personal start up like you know well, the day I'm not there anymore, well it's probably the end of the start up. People move, people change, people change either jobs or priorities. It's a big company, and so we wanted something that's easy to maintain. And on top of that, we have the plugins which is basically a very simple way to build what we call them Server Partials. So instead of having like one big server that has all the routes and all the methods in there, you can actually break it into pieces then you can load them in a modular way into it, so some of the plugins we have are specific to QA or specific to production.

But it doesn't mean—or dev can load them separately, but we don't have to have one build, it's just a config that says which plugins are loaded in each environment. This is an interesting data, it's worth mentioning. We are now in the process of starting to migrate to a private NPM deployment and we're working closely with nodejitsu.

Where's Charlie? Right over there. Working closely with nodejitsu to customize the private NPM package and everything they are doing for us is part of the open source distribution, so we're not keeping anything proprietary. But if you work for a big company, you'll get yourself instituted at some point where you have developers and they want to just keep downloading stuff from the internet, and you have security people who don't want you to download stuff from the internet, and it's one thing when you first get something going on, and you deploy it, right?

Nobody here had ever had a security that'll actually review the first time you brought Node to the organization, but in a year later they'll say, oh, hold on, why are you getting this new version of this module? Have you checked that there's no—because once you already have have it in and people know that you are using it in your organization, well now we can target you, so if I knew you were using hapi and I inject some back door into hapi, well great, now I can go and hack anyone who is using hapi, right?

And open source helps a lot with security, but it doesn't solve everything, so we decided as engineers to pre-emptively prevent the bureaucracy from coming and telling us how to do our work. The solution we came up with is basically this private NPM solution where Devs can use whatever they want, there is no network blocking, you go use NPM, the public NPM, or the private one for the one that we don't want anybody to see it,

the internal module for us, but then QA and production can only access the private one, which means that if you as a developer is creating any dependency on outside modules that are not available inside, then somebody will have to review them, and approve them, and we'll be working with nodejitsu on making sure that the process of doing these reviews in Whitelist and approval, and keeping it in sync is really, really easy to do, so that the engineers can stay doing it and nobody wants a change committee, right?

That's bad. Stability has been a big issue for us. We in hind sight made a mistake of going live with .10, so we basically went live with Node 0.10.2 or 3 and I can tell you that we have encountered every bug that was fixed between .3 and .17 in production. And I don't have like some kind of a policy to it, but I can tell that the lesson I have learned, people expect me like, oh, so are you going to wait for .12.20 before you move to .12? No, no, no, my solution is I want to move to .11. Now not moving my entire data center to .11, that'd be crazy. But what I want to do is I want to put one or two machines with production data on .11 and if they blow up, well we have load balancers, we have other things that'll keep it going, but what I'm trying to do is say, nobody is using .12 until my use case are all met, when .11 is stable in my environment that's when I feel comfortable giving feedback back and of course we've only one data point, but if the large user is giving it real balance testing it, and giving feedback that the community will do the right things, they'll say maybe its not really ready cause we don't want to do a .12 that already has known bugs in it. So if you're in a position to actually give .11 a spin, please do it now. If you're going to wait to stable .12 release, you're basically waiting another six months and that's a shame because there is a lot of really good stuff in the next release. There's some major performance improvement and some memory improvement that you want.

So go ahead and do it now, that's the lesson learnt, not to wait until now.

Sign up now for Instant Cloud Access Get Started