Alright, so Eran decides he's going to send me this Excel spreadsheet, and he's like, look, I've found the leak, I've totally found the leak. And so, Josh Clulow and I sit down, and we plot this leak that he sends us, and here's this wonderful thing. It's like, see look, it's trending up. It's totally trending up there, and I was like, no, no.
I don't see it. It's not happening, it's not happening at all. And then we did a go ahead and plot the trendline, sure enough, there's a certain amount of growth that's happening there. I was like, OK, alright. Alright Eran you deserve my full attention now. So, he sends us his little script. We shall go fullscreen.
No, that's not going to work. [audience]Yes, there are yes there's it.[audience] Yes, please do. OK. What is going on? Anyway there's an R there, and an E and a Q on that left side. Anyhow, so we're going to start here with a little script that Eran sent, the server straightforward, we're just going to spin up a server. When a connection comes in we're going to resume it and then we're going to end it.
We really don't care what the server does. Alright, I closed the wrong one. Alright, so here is this little script from Eran. It's nothing particularly fancy. We're going to call request. When we get a response, or we have an error, we're going to kill it all. If we don't hit that in 60 seconds, we're going to kill it anyway, and then we're going to post a little bit to it and that JSON stringify post, and then keep track of how many requests we have in flight, and then we're going to do 400 of those every second. So nothing particularly fancy there, and then down here at the bottom, we're going to print out the memory usage. We've got RSS, heap total, heap used, and the number of connections that are still pending out there.
I was like, okay, we're going to run this and we're going to do—we're going to see what's going to happen here, Eran. You say this is leaking, let's find out though. I did enable the GC, just to be on the safe side, so like the GC runs every 500 milliseconds or 50 milliseconds, something absurd, so just to make sure. So this little script starts running and it's like, you can't even see my ZOM, ZOMG new RSS, so it's like checking to see whenever we hit a max RSS, and then we get 10 seconds of RSS printing out, all of this RSS is now going to be truncated, this is really bothering me. Okay. ZOMG. Can anybody see? Oh! my God, the RSS is rising, and if you watch this over time, the V8—so this V8 heap here, and then this is the amount of objects that are being used in the V8 heap, it's staying pretty much the same. It goes up, it comes down, but what end, the RSS will go up and then come down. But then you keep running this, and Josh and I, we just decided we're just going to walk away from it the same way Eran walked away from it. We're just going to stop looking at it. A watched pot never boils. Josh and I keep saying, it's like heap fragmentation, something's going on, this is going to flatten out at some point in time, this is really going to flatten out and we're just like, walk away for four hours. I was like, Josh it's still happening, he's like no, it's small, it's a little leak and it's still happening, and he was like, OK. So one of the great features about running on SmartOS, is we have this wonderful feature that you've heard a bunch about, about core dumps and MDB. I was like okay, fine. Alright. Since we don't see the V8 heap growing, and as a matter of fact, we can just go ahead and look at one of these, so I stored a bunch these and so we're going to log in and check out one of these core files. This was about two hours in, and when we look at it, you're going to be like there's hardly, hardly any.
Yes. Say when. Okay. Can you see? Alright. So it's going through finding all the objects and I'm like, oh, well, there's a few thousand objects in arrays there, but we are doing a significant amount—now, if you remember we're doing like 400 requests at a time. So it's like 400, 400, 400, 800 because it requests response or whatever, so we're seeing a consistent amount of objects there, I'm like we aren't not leaking.
Straightforward leak, and on Linux you would run Valgrind or whatever to find out what this is, on SmartOS we're going to run with—we link against libumem, which is a replacement memory allocater and we can do some excellent auditing of what's going on. So we're going to turn on some auditing, and we're going to say, so every time malloc is called, I want to see the last 200 frames of the stack.
So malloc is called and we're going to ahead and grab the last 200 frames for what happens. We're just going to let that run, and that's going to run in the background, and then we're going to come over to here. [do do do], and we're going to say for the next—we're going to go from 1–100 every hour. We're just going to go ahead and call pmap-x, gcore that process, and then put all the results in Manta. And we'll go over that later, but anyway.
So we're just going to run pmap on this process and gcore on the process to generate a core every hour on the hour, and we're going to do that for 100 hours or as long as you want to run this script. If we look here, and we grep, and we look at pmap -x, OK? We've got a bunch of information here. So, you're looking here at your RSS in kilobytes, and the address spaces of where they are, and here you can see we've got some some stack space, some process space, and then here we've got heap, and then a bunch of these are anon-sections. Now, if you're looking at these anon-sections, these are memory mapped regions, and that's where V8 is storing its own heap space. So when your V8 objects are being allocated, they're being allocated down here in heap space in the V8 heap in this part, and then here under the regular heap that you're used to with malloc and free…
This is going to be the interesting part here. So we take this every hour on the hour, and we look at a couple of these, and I think I have a couple of these already on here. So here I have pmap.1 and one from 15 hours later, and I diff them, I'm like, oh! goodness, my total RSS here two hours in or one hour in was 58 MB, and now it's 224 MB. And there's hardly—you've got some changes here in the anonymous space but not a whole lot, and when you look at these going around, you've got the exact same memory sizes just being shifted around in address space, so you're like, OK, so all we're left with, if I scroll up here just a little bit, you're like, huh!
Well, look I've got 160 MB more in my heap than I did before and there's 160 MB right there, okay. So it's not V8. Okay, alright, so we have a malloc that's not being freed. That's what we know. Now, when you're in MDB, you're like OK, that's fine. Let's go back to this core from 15 hours ago, so we should be able to, so you've got a malloc without a free, you're in Linux, you normally run Valgrind, you can't really run Node with Valgrind because its a jit, it slows everything down, and you get very, very sad very, very fast.
So here we have MDB, and because we ran with umem_debug on, we can run this command called ::findleaks, and this is magic. This is straight up magic. I have this little add-on layer for Node, so you can write Node modules in C, and I make mistakes, and I just keep blowing up memory, and I'm like, OK, it's fine.
I'll run with umem_debug on. I'll come here, I'll type ::findleaks, and sure enough, every time I do that, I can see stack traces of where I've screwed up. I'm like, this is what I'm going to see. I'm going to see here, libuv or Node has just totally, totally dropped the ball. Magic. Oh wait. Down here is where I'm supposed to see all of this information, and all I have is these four spots here, and when I add up the bytes, that's not 160 MB.
So now I've got a problem. So it's not a traditional memory leak, so we've called malloc and that means the memory is still there. Still hanging around somewhere. So, let's spin up another of these and mlogins public/walmart.graphs/core.2. Okay. Wake up, wake up. It needs to, can you make it faster?
Yes, so what we're going to do, is we're going to load up MDB, whenever I can load up MDB again, and we're going to run this command, ::umem_cache. Now the way umem works is, when you allocate memory, you get stuck in these caches based on the amount of memory you're asking for, and then you can look at the results of this and see how many allocations have happened in specific distributions of size.
So if we come, I'm going to look at this one first. mdb MANTA_INPUT_FILE, and we type ::umem_cache, we see the size of the buffer that we have here, and then the total number of objects that we've got going on here. So, not a lot of numbers here, these numbers look pretty regular but as the sizes of the cache increases, the size of the buffer increases and then here's our count. So we're going to, we're looking at one from two hours in, and let's go ahead and look at the one from 15 hours in, and see what kind of information we have.
We're like, well, something grew because we've got 20,000 things in the audit cache. Uh-oh.The 4K buffer has 16,000 objects in it, almost 17,000. And here we are at the 4K buffer and it has 1700 objects in it. Okay, it's like, so at this point, I was like I've found it guys, don't worry, I know where it is, fear not. Alright.
So how am I going to find the leak in all of this? Now I'm going to have to figure out what's going on here. I'm like okay. so Brendan Gregg, our wonderful DTrace guidebook and the person of all of these visualizations here, came up with this wonderful flame graph idea, which is all about performing and finding stack traces for your application which we use nonstop all the time, and I was like, what would be really great, is if I could take all of these stacks and find out who's responsible for the majority of the memory in my heap, okay?
Tell me if I did that wrong. Alright, yes, awesome. I have this hammer and there are nails everywhere so I uploaded all of these core files into Manta, and I wrote this, we have stackviz that I'm going to pass to it, I've got this symbol table and then my little C++ fill but for Node that I'm going to post-process everything on. The actual job looks like this and may make Brian slightly nauseous to see that I didn't end up trying harder on this. So for every core file in Manta, I'm going to print out the umem cache that you saw before and grab the 4K buffer.
I'm going to walk that buffer like we saw, grab all the stacks for that, put it in a file, then I'm going to filter that file, make sure I can get some good stacks, then I'm going to run that through stackviz and make an SVG and put it back into Manta. This is perfect, I'm going to do it for 15 minutes or wherever.
15 hours. [audience] You're OK with that awk? If you want to see this, I have a little job share that I will tweet here. I'm live tweeting, so I just tweeted this job share if you want to see, this is how I ran the job, and you can see that I've got 16 core files here that I've run through and the result is I have 16 flamegraphs.
So I'll go ahead and click the first one, and coloring is not necessarily what I would like on this projector. Why don't…Chrome, don't hate me, OK. Oh! my gosh. It's difficult to see this. OK. So the coloring doesn't work here. Anyway, so this shows you all the stack frames of where those memory allocations are, you're like, OK, so this is one hour in, I'm going to see these, let's look at that, let's look at 15 hours in, like we were looking at it before, like oh!
goodness, is it taking a lot longer to load, but still the same basic distribution here and we're looking through, looking through, boom! Parser executes (this is HTTP parser header being on there) like I was like, oh! awesome, this is totally a problem in the Node HTTP parser. No, no, not that problem, not at all. Boom, boom, boom, boom, handle scope extend EV.
Now we're down in the depths of V8, stuck, mired in an awful, awful, dark, terrible, terrible place. OK. Alright, this has been an whirlwind tour, I swear we're nearing in on the end of this. I promise. OK. What's next? We're still having RSS grow over here. It really does grow Eran, you were right, I was wrong.
Okay, so we're going to /parser_execute. Okay, here we are, somewhere in here, and we're like, okay, we're going to grab this memory location here and we're going to ask MDB to find this memory address somewhere out in the address space, and it's like, oh! you got a bunch of these here, and you're like, what are those things? They're like, they are not kind of interesting, they're not all that interesting, okay, well let's try and find this. Sometimes V8 likes to tweak with pointers just a little bit, if they like to change their first bit, so like, well, maybe if we just mask off the last bit, we'll find V8 has changed it and held it somewhere else. But what if we just go ahead and be a little bit more aggressive? And we'll ask for 3-bits to be stripped off the bottom.
And then we pipe that to what is, well, we found a few more extra things, and then hmm, that's new, umem_oversize. That means that V8 tried to allocate something that didn't fit in one off its own caches sizes, right? Yes. So, then we're going to take its address space. This is my cheat sheet because this just happened about an hour ago. We're going to ask it for this. This is going to happen, Bryan is already, I see is churning on this. I need to walk the vmem_seg, right? What is it? How come that that didn't work Bryan? Did I grab the wrong address? No, I got the umem_oversize. And I can tag vmem_seg, that's so beautiful, it's pretty. Right. 400, yeah, that's the address that we saw, and then this guy. One could argue.
So now, an hour ago we found exactly where V8 is allocating this memory that's holding on to the memory that's being leaked. And unfortunately for me and Josh, we spent most of the day in this code, and I knew exactly what this was already, and was very very sad. Because this is basically a slab allocater that V8 has internally that they keep track off of all the stuff that they've allocated and then told us what we already know that they're forgeting to deallocate it. So, in case you were expecting to find ...profit, we're still on ..., tomorrow, absolutely. You're making a promise on that, Josh? Okay. To do, we'll go. [audience inaudible]Right. How much memory you're actually using. Yeah. [audience inaudible]
Yeah, but now I don't have a useful terminal size.
So this is going to show us all the addresses that were allocated, right? Alright. Okay. So we're going to dump this address space, and if we grab one of those, we're like, did I load, just to be sure. We're like, okay, fine. What's this address? jsprint that, like, oh! undefined. An array of undefined.
Thank you V8. You're awesome. [inaudible audience] Yes. That thing. That is a V8 problem not a Node thing. It could be—since this doesn't happen because this is in the Handle Scope stuff, so if you don't know what a Handle Scope is, Handle Scope just says, oh! all these allocations that happen in the scope, when we leave the scope, deallocate them. And then we were thinking that Node wasn't just using this right, but if this were a pure V8 bug, then everybody's Node application would be very, very sad. But instead it's specifically around how we're exercising V8 in the HTTP client's path. Whirlwind tour, I know, that's what happens. And this was not just me working on this, this was Josh and Bryan and Dave and Robert. When the signs go up, I was like I keep getting as far as I can and then I asked Josh, and Josh was like, oh! we're going to try this, and then we keep going a little bit further and then we like, ask Dave and Dave's like, oh! what if you tried this, and then it's like, hey! Bryan, we have an hour before the talk, what do you think?
Yeah, so this is what happens, and this is. Yeah. yeah, no. Oh! yeah, is that what happens? What do you want to say about this? Right. Yeah, so the haemoscope keeps track of all the local scope objects there, and so as objects are being added to your handle scope, then it decides whether or not it needs to extend that handle scope to add more space.
So it's possible that we're just allocating more objects in this path on the HTTP parser side of things. That's making V8 decide, when it normally wouldn't, to add more space underneath of it. There are, four distinct spires in this. You can't see it as well from this, even if I blow it up, it doesn't really even get all that interesting.
There are basically four distinct spires. This main path over here is all HTTP parser side of things, and on this side over here, it's all timers, or starting from the timers, but not coming through the parser execute path. So, there are interesting things there, but it's not only in the HTTP client path, like it's possible that there are other pieces that we're seeing, that people are seeing this in Node, but it's so small.
I mean, were talking 4K every once in a blue moon. So. This is Trevor Norris by the way, who is also on the core team. In case you want to know.
[Trevor] What if we just call the HTTP parser bindings directly, skip the whole client round trip and just [xx] the http parser?
Why don't you?
[Trevor] I'm going to go do that. Well, you have the knowledge.
[audience] I have a question to Eran. So you originally wrote your own client library because you're afraid with Request you wouldn't be able to debug a memory leak, why don't you just swap in requests [xx]?
Well, it's because this, I begged him to find it without hapi involved in the mix. So this is all pure Node, so even Request is going to suffer from this problem as well.
[audience] So nobody else has noticed this or, nobody cares? [xx] another project? Why is this the first time it's been found?
Well, not a lot
of people are doing long-live processes with making HTTP client calls. Now NodeJitsu certainly does for their frontend proxy stuff, I'm not sure if they've been noticing that, but it's certainly how Eran's side of things works.
[Eran] Is it on? We're good, alright. So, first of all
we started with request. We had some problems with request when we switched to .10 because hapi is fully Streams 2, and whenever you put anything through Request, because Request will pause and resume stuff, it automatically switches to old Node. And it really is annoying me that I had Stream 1 work code running when there is no reason for it because everything else in our environment, if you're using hapi with your environment and you have some old Streams code that you're including in, well that's your decision and it will work, but I just didn't want that in. But we were getting the crazy leaks with Request. And then we got rid of it and nothing changed, so that wasn't the issue but it did help us because now the code is tiny, that was able to produce. I want to point two things out that are like, if they're not obvious, I think they're really good advice for anybody who's dealing with this kind of stuff.
One, Manta is awesome. So, none of this is possible without SmartOS, and I'm not plugging anything. Anybody who knows me knows I don't, I'm very hostile to like promoting other people's stuff. But the thing about, you need SmartOS in order to get all the magic that he was just doing, that's SmartOS, that's really the sweet spot for debugging on any platform, and it's still not the nicest environment to install in your home VM.
SmartOS is really designed to be installed as part of the Joyent cloud offering. So, if I want to do with all this stuff at home, and I don't have SmartOS installed, it's going to be challenging because—so what is really convenient, we've done it a lot, is we throw stuff from production to Manta, all those dumps, and then you log in to Manta where your dump is, and you basically get an entire SmartOS shell, and you can run anything you want.
So you basically sshing into your dump, the computer where your dump is locally on the file system, or at least that's what it looks like, and then you can run it. So it's very easy to set up and that gives you, anywhere you are you can basically debug all of these, and also those dumps can be really, really big, especially in the memory leak.
[TJ] Yeah. So in like one of the alternatives would be like heap dump, the heap dump module, which then you can load into Chrome, but if you're in a bad situation where you're actually loading or hitting a memory leak where you hit the V8 limit of 1 GB, and then you try and load a 1 GB file in Chrome, things get also sad there.
[Eran] And the other thing is, if it wasn't already obvious, is that not many people can do this stuff. They're all right there in this room, and so I will highly recommend that if you're betting your house on Node in production in this scale like—and Walmart, the corporation is big scale but our Node installation in Walmart is not that big, we're passing all mobile traffic through it, but it's not like passing all of Walmart e-commerce or all of Walmart world wide traffic through this.
So it's big, but—Voxer is probably doing a much bigger scale of everything than we are doing with Node, and I think Matt will say the same thing, is that you want these guys on the retainer. So you can call Bryan and say, hey! Bryan, come over here, here's another dump for you, go. So really.
[Bryan] I am actually that easy.
[Eran] But yes, we have a great team in-house that can do a lot of stuff, but this takes specialty. Like you basically have to spend quite a lot of time, and even then you have to go through places where you have to truly understand how your operating system is allocating memory and all those addresses, and I never wanted to be a kernel developer so I'm not. But there are people who are great at it, so I would just point it out that that's something that you should pay attention to.
Node in Production
See techniques for deploying a large-scale, high-uptime production cluster.