Walmart Node.js Memory Leak Part I

Memory Leak Huntin'


Eran Hammer, Sr. Architect, Walmart

After about a year waiting to have a production environment that can actually run Node, it turns out that we've been building Node in our services and then we are like OK, great we are read let's release. Well can't compile Node on the Solaris distributions that we're running in the data centers. So now we had to wait until new machines, new operating system, everything has provisioned. And then we finally were ready to go and we went all in one swoop, new hardware, SmartOS, new version of Node, like everything all at once. Put it in production and it goes [explosion sound]. Basically, very first version had like a memory leak that killed the server within like 10 to 15 minutes, it was like crazy.

Alright, fixed it, then we read all the blog posts about leaks, and JavaScript, and Node, like the biggest one you have everything on instead of once on the socket and now you basically the socket will never get cleaned up. There's a bunch of these you can look them up, but we got everything clean up and now we have a process that is running, doing what it's supposed to do, only we're looking at our memory—and this is recent this was September but this is back like in June—basically leaking 150 Megs a day a day. So we're running 150 Megs a leak a day, which means we can keep the server going for about five days, and then we have to restart them or they just go [explosion]. And we are looking at it and the V8 hip is stable, so it's not like JavaScript memory leak. It's not like closures or just objects not being referenced, it's something else. So we start going through our adventure, and we go to like all the experts at that time, whether it's Voxer or Joyent, and we are like we ask them like hey, help us we can't figure this out. And looking at it, we kind of went though all the obvious things of like, maybe this is just load. You get to a point where you have too much load, you buffer too many things, and then one slow down, you already exceeded like the safe environment and process so maybe inecrease the V8 memory size, or just put more servers there, or like all the typical things they're basically saying, this is not really a memory leak, it's more of a back pressure problem.

You're having too much load, you're not handling it well and when that happens your server goes into a bad state. So we try all those things but I was like no, no, no look at this thing. Now the thing is if you're looking at—so this is Node midnight to midnight. This is in September. We have actually now got it down to eight, 22 Megs a day,and when you're leaking 8 megs a day, you stop caring. Because at this point, well I could run for a pretty long time because we're pretty stable with like two 250 on the baseline, so if I just stay at 250 and I lose 8 everyday, OK I'll recycle my servers once a month.

That's good enough for me. It's still better than a .Net installation where you have to restart the entire NT operating system, right? So we're all familiar with like web service leak, and well that's part of the business. But it was like driving me nuts, so once we got low enough, we moved to other priorities because now we can sustain it's in production it's fine it's working. But then we start getting the numbers lower and lower, and I was like we're not actually trying to solve this memory leak and it's getting better.

Why? But the pattern looks exactly, it always looks like this. What the hell is going on here? So we started looking at the numbers, right? Like we get obsessed about these numbers and we're using Splunk quite heavily, so every 30 seconds we're getting all the metrics from every Node process we have, and then it goes into Splunk, and we chart this.

This is just RSS, which is the only thing we had a problem with. And I'll show you like live, some of our live metrics right now to see like there's some really cool charts like you'll see in a second like CPU, the most boring chats we've ever seen. Node CPU it's just boring. So alright, so let's dig deeper. I spent four months trying to do this. Now, we could not reproduce this. For months we just couldn't, and Wyatt wrote a recorder that records everything that comes from the server and we replayed the same thing.

Nope. Nothing really. Only in production. So let's just look at this. So this is one day. So here's our one day. And you're like, oh, this looks suspicious right? Like what is these jumps? This is a problem. Something is happening right here, where we're jumping up in memory. And this was a complete red herring. So if we zoom in it's not really a straight jump. It's a little—but then if you zoom in even more well now is not even a jump at all.

So this the most data that we have, this is 30 second points, right that's all we got, because when we try to do more, then we're basically generating way too much data and other problems happen, but clearly memory increased here so the baseline was here and now the baseline is here. But if I go back right, well yeah, but then so we're looking at this guy, right?

But then we went up, but then we went down again, so we don't know which memory is not clearing up. We don't know if this guy here has anything to do with it, if this location is the bad location, most likely it's all these little guys that are actually doing it, and this is just the red herring. But we spend a lot of time on this.

So I saw these, and I probably was distracted from many, many TV shows sitting with my laptop and trying to figure out, and if you look at what we are monitoring today in the good plugin, like garbage collector site calls, and client disconnections, and like all sorts of things because we kept coming with theories, like, oh, it's client disconnecting.
We have too many client disconnecting and that's causing some kind of leak somewhere, so we added more sensors, nope. Couldn't find any correlation, just no correlation with anything reasonable, except that we knew.

Now so I
started asking myself what have we done to lower the memory leak impact? And I look at the—we try a different version of Node, behaving the same thing, so we did something, and the more I thought about it, OK, we have reduced the amount of HTTP client requests we're making. So we are operating a proxy right, so every request comes in we are making an HTTP client request upstream, and over time we have improved our proxy performance. We have improved some caching, we have put some optimizations, we added capacity, and so all of a sudden, we are processing less request through, we're making less HTTP requests, and there was one thing that we remember that every time we turn on sending analytics directly from Node to our analytic system, versus writing it to disc and then sending the file over there.

Memory went nuts, so I was like, OK that's a hint. So something about the HTTP client is a problem, and we know there's some memory problem with the SSL one, but we were just doing straight HTTP. No SSL or nothing. Alright, so it's my favorite chart in the whole world, this has been my friend.

So this is about over two weeks you can, every bump is a day. it's actually like this is like peak time like 6–8 PM is when most Americans like to shop. And I said hey guys take two servers out, one from each (we have two data centers) take one from each data center out, and reconfigure them to go back to sending the analytics via the HTTP client. And we bring them up. So right here is when we kill them and then see and then we start them again. And we were having a team argument where I was right and everybody else was wrong, where basically we were looking here, right, so this is like the first few hours, and everyone was like see you're wrong.

It has nothing to do with that. I said, just wait, and then he jumped up, and he's like, no this is just warming up. Give it a day or two, you'll see it's not it. You're looking for things. OK, fine because this is a big one. Then we are looking at this one, and I said see it's more. No, no and this one was like, see you're absolutely wrong, it's way slower now.

And I was like, no, and it's like it's not hard to tell, but then this happened. So this server had been up for like a week or two before, so they're building up. This guy caught up pretty fast. And then if you do the trendlines, well this trendline is exactly two times the other trendline. Because for every HTTP client request that this guy is making, this one is taking the analytics with the other one.

So we're making twice as many client requests, and well, we're leaking twice as much memory, and I was like, Yes! So now, I know what it is, and it's one—a long time ago we were still working on this. We decided not to use request, the module request for HTTP requests, because we felt that there was so much going on in there, that if that's where there's memory leak we would never be able to find it. So we wrote our own tiny, tiny, little like 50 line request replacement.

So it's only one file. Within hapi there's a file called client, and that is what we are using. And that's the client that this one is calling, that is the client this one is calling. So I was like, great, I'll take that one file, and I'll write a script that basically bangs the hell out of this. And I measure what's going on in memory. Because we doubled the uses of that function, and we doubled the memory leak. Awesome.

I run it for an hour, I plot the results and it's flat. And then I go and I double the speed. Nope, flat. And I was like screw it. I can't deal with this anymore, I just leave the computer and I go out to the farm to play with my animals, like I'm done. I can't handle this anymore and I forget about it. I come back in the morning, I look at it and I was like I forgot to kill this stupid thing, and it's like my console is like printing all the numbers all day. I was like let's see what happens, so I kill it, copy and paste the whole thing and dump it in to Excel, now I have like 14 hours of data and I plot it, and when you do an excel the first thing you get is like a tiny little chart, it's like tiny, and it's like big thick line but flat, I was like great but I don't know why, wait. I make it really big and I go there and click right click, and I click trend line, the trend line is going up I was like yes.

So it turns out that in order to find this damn memory leak, an hour wasn't enough. Like you really—the increase was so small, but consistent over time. So at this point I had a script, and I was able to go back to the core team and say hey, and my hope was that I'm going to give them this gist and they'll say, hey idiot you don't call JavaScript this way, or here is your closure where you're leaking, like I was sure. But I was happy with it because if they can point me where the problem is, this is gone, right?

So I go to them and they're like just turn this on and off, and just go and run it for like 10 hours and come back, and that's where TJ comes in. And it's not my problem anymore.

Sign up now for Instant Cloud Access Get Started