Node.js on the Road is an event series aimed at sharing Node.js production user stories with the broader community. Watch for key learnings, benefits, and patterns around deploying Node.js.
Dav Glass, Node.js Architect
I want to tell you guys a little bit about what we do at Yahoo with Node. Can anybody guess the first version of Node that Yahoo actually ran in production? No, no, no. Not that old. 4. 0.4 was the first version of Node that Yahoo actually ran in production. We actually built our stack on top of 0.3.9 I think. We love node, I love Node.
I can't tell you how many times I've spawned up a REPL just to do some math because I didn't want to open a calculator. That's just me, I mean how many of you guys have ever done something like that, you just spawn a REPL up real quick, because you're like, I want to know what this is. Or I'm looking at some of the stats that we have and I have this like giant array of json that sits in the browser window, right. Cut and paste, open the REPL, paste the thing in there .length, tells me how many items are in the thing, I don't have to do anything anymore. So I love Node. I am very, very passionate about Node. But the one big question everybody asks, is does Yahoo actually run Node in production?
So the question that is, is what do you consider production? Because our corp instances of all our internal tools, we treat those just like we treat our production instances. So if I have an internal tool that they're using, it runs on the same stack that our production instances are. So not only do we have Node in production but we run it internally inside for all of our cool toys. So we have this huge build pipeline, so we have the ability to—for the developers to write real time on their local machine, on their Mac, and they can test all of this stuff and they commit the thing up, and then it runs through a CI system, right?
And then that CI system will then build it, and then it will run its test and it will package it, and then it will deploy it, and then when it deploys it, it deploys it in this thing called Manhattan, and Manhattan is our hosting infrastructure, that's Yahoo's global hosting infrastructure, we are talking colos all over the world, and that is how we build our code. But the cool part is, is that Manhattan infrastructure is written in Node.
The API that we use to talk to it, that's written in Node. The command line tool that I use to push that, that's written in Node. The CI system, it runs on Jenkins but we actually spawn a Node script that runs through and does all that for us. So everything that we do, from just on our local machine all the way down the production pipeline, is written in Node, tell me that's not cool, right?
I mean, come on. If somebody at the scale of Yahoo is doing this, that's just cool. You can actually pull up the homepage, and I can point out the little boxes on the homepage that were rendered with Node. You can go to my.yahoo.com, and I can point out the little boxes on that page that was rendered with Node.
Some of those boxes were rendered with Java, some of those boxes were rendered with PHP. There's a lot of those in there written with Node. There's actually a couple of our sites that you can pull up, and you can open the developer tab. You can click on that request, you can look at the headers and it says X powered by Express. Come on, isn't that cool?
I mean, it's totally cool. But it's not all cool. It's not always right. It doesn't always do everything that we want it to do, but the cool part about Node though, is it's actually hardly ever Node's fault itself. It's some module somewhere in between that does something stupid. That's what we've run into.
We've actually run into the problems is that some of these developers don't understand semver. Yahoo does. We've been built on it for years, but some of these guys don't, so they suddenly change all this API under the hood and they say, .1. And now everything breaks! So the problems that we actually really have is more of teaching the engineers, teaching them the right ways to do things.
Some of the problems that we have is like, we have this old deployment system called Wyinst. Anybody ever heard of Wyinst? Well, Isaac actually calls it the grandfather to NPM because it gets a lot of stuff from it. So, we package all of our stuff with Wyinst, and Wyinst just basically creates a giant tarball. That's really all it is.
But some of our developers haven't understood like the proper ways to do their versioning, so we end up having this module up here that is like 1.4, and then down here, we've got five modules that all require this one but at 1.5, so now I've got one module here and then I've got five copies of this module here, well, that five copies have 10 dependencies a piece, and those 10 dependencies have 15 dependencies a piece, and the next thing you know, my tarball is 950 MB. So, those are the type of issues we actually have, is training people to do this stuff, and not actually with the performance side of things, because the performance side of things like I'm going to show you in a couple of minutes, has been freaking crazy.
The difference we have is that we actually have to fit inside of a giant old infrastructure, so Node never physically touches a customer, so the customers never actually touch Node. They touch something in front of it. We have this server called ATS, you guys ever heard of ATS, the Apache Traffic Server?
Okay, well you should because it powers like 50% of the static content on the internet, it's like what all the edge caches are made of, so like Akamai's edge caches and all that stuff, those things are mostly ATS. They sit out there and they just pump out static data all day long. We actually have this thing called YTS which was the first version, before we open sourced it and gave it away, and we gave it away and it became the Apache Traffic Server. So we actually use that as our barrier in between, because the biggest problem we have is the one that we solve with the one and only use of domains that you will ever see (I say that because he hates domains)…
With our kind of traffic, we can't just kill the whole pid when there's an error, right? That's just stupid. Because when I'll show you these in a minute, you guys see that little chart that was on my first slide? That's one of my production charts and, I'll just give you the number over here is up in the 30s but I won't tell you whether its seconds or minute or hours, but…we can't actually, if there's an error, we can't just stop and just say, hey, you 150 people 150,000 people don't get shit, right?
You can't be checking your yahoo mail and suddenly the whole thing dies because something had a syntax error somewhere. So we actually wrap that, and if an error happens, we shut off the TCP connect and so it will always get a 500 out off of it, that way the edge server will throw it out of rotation.
And we can spin out the event loop and finish any in-flight requests and then let it finish off, and then we can gracefully kill the thing and do whatever we need to do with our logging and all that stuff. You page half a dozen different people and then fire the thing back up or whatever we want to do.
But that's generally the only problem that we've ever had with Node. I don't think I've ever gotten any thing more than a documentation change in, because I haven't needed to. We actually have not needed to make any core changes, the only changes that Yahoo has actually made to Node is Yahoo customizations.
So we've built security into our custom compiled versions because we need to. We have, I actually have open sourced this module. It's up on GitHub, github.com/yahoo, it's called FS lock, and when you turn on FS lock you can give it a string of directories, and you can say, this is the only place that the FS module can touch. It can't touch anything in the entire system except this set of directories. You can give it another set of directories and say, this is where require, can get its tuff. So you can't ever fetch anywhere except for where I tell you to. We actually bake that into our binary for security reasons. That way you have to opt out of our security on our system and not be able to circumvent it, that's just wise security. But the only way that you can actually get around it, is with an environment variable, because that environment variable needs to be set before the process starts. We also don't allow child_process. So you can't spawn anything, you can't start anything, and you can't touch anything that we tell you can't touch. But with the environment variables, that means, I can set those in a bash script, and then I can call my command line tool. So now I get all the cool stuff and it unlocks the security for me, but other than that, the only real pain is trying to keep up because they are so much faster than we are.
You know, they're constantly chunking out new code, so like I said we have this giant build pipeline. Can anybody tell me what version of Node they think we're running now in production? 10 what? .10.25. And .10.26 goes out in two weeks. Because it just went through the pipeline as we were sitting, actually I was talking to TJ earlier and I'm like, hey, look at that 10.26.
It just went through the pipeline so it's good to go. We also build all of our stuff on .11, but we have to do it with this, no offence, we have to use nan. I'm going to blame you man, that's my job. So you guys ever use nan? So you guys must not be compiling your own stuff. So we actually compile our stuff because Yahoo is actually an infrastructure company, so we have to support many, many languages.
All of our libraries have to run in Java, Python, PHP, C and Node. So, with those things we have to be able to wrap all of those internal tools and use the same source code from somebody else. So we don't have access to that source code, we pull it in from them and then we wrap it the way that we need to in order the make the binary because they're responsible for that.
Every team is responsible for their own thing so they have to maintain our compatibility. So by using nan, we can actually run 8, 10 and 11 at the same time, once their tests pass. I will give him a hard time because he's not living up to it yet. But for the most part, those are our big issues and it's not really issues like I said with Node itself, it's actually issues with V8, or it's issues with a module, or something like that.
But the great thing is though, we give back. So we've actually helped Rod add a couple of things to that because we needed it. We need to be able to do this stuff. We're constantly adding back to community modules. I think we have 180 on github.com/yahoo that we give away. I mean my job is to say if that doesn't have a pattern or a trademark on it, it goes.
If there's no proprietary code in it, that thing goes on GitHub just like that. So, I've gone through 11 minutes, I'm good, right? I don't know where she went. So I've got numbers. You guys want to see numbers?
That's what we do per day. So we have over 400 applications and 400 libraries building per day on our continuous integration system.
In this deploying to Manhattan, that is either internal or external, so something is deploying to Manhattan internally or externally, I have my own NPM search that can look for packages. We're a search company. That's what we do. So, I have my own, but it indexes my stuff, that way my guys can find all of our modules because we've got 400 modules that these guys are sharing, and they're actually sharing, they share these things back and forth between teams, they do the whole community thing, that's what was so great about it, is that our engineers build stuff a Yahoo like you build stuff. So you guys use GitHub, right?
We use git. You guys use Express, right? Do you guys use Express? Some of you use express. We do too, we are looking into few others, but we use Express. You guys Travis?
Please tell me, God tell me that you are using Travis to test your stuff. Come on, it's free. Okay, well we have a thing called screwdriver which is the same. It tests all of our stuff, but we get a little more strict about it. We have rules that says if you don't have so much test coverage, you don't publish. If you don't have so much of a lint report, you don't publish. I should say hint report, but we actually have rules for these libraries inside of Yahoo that they must have over 80% code coverage, or you can not use it.
You can not share it if you do not prove that you have over 80%. Any of you guys ever heard about code coverage tool called a Istanbul? You know we wrote that? One of our guys wrote that. And it's all written in Node. It's cool, isn't it? It will hurt your feelings bad, but that's what we use. It will, if you thought jslint was bad, wait till you run Istanbul, when you think you're covered, and you're really not. So we run that on here, and actually, the primary engineer on Screwdriver, which is our continuous integration system, was the guy that wrote Istanbul at one of our hack days in a 24 hour period. Cool, right?
That is really freaking cool. So, I've got some pretty charts for you. So 300-500 requests per second. That's kind of cool, right? That's alright, 305, that's one of our new ones. It's an internal thing. What about 1200 requests a second? That's getting up there just a little bit more. And a little bit higher. You got about 1800 requests per second, that's not too bad.
So these are the ones that are powering those modules on yahoo.com and my.yahoo.com, and they are in a 3% bucket. That's 3% bucket and they're doing 1800 requests per second. The server that they talk to does about 3500 requests per second. Now you also have to remember that these requests per second are after it's gone through the cache.
So this is the actual Node number under the hood, this is not the ATS cache in front of it. This is the actual physical touching of Node. And this one's kind of cool. That one's kind of cool, that one's actually really interesting. This project is something that I was following for a long time with Yahoo. It uses zero modules.
It is 100% straight Node. It is super cool. Any of you guys ever watched a video on one of Yahoo's web pages? We got a lot of those just floating around, right? Well, this is our quality of service beacon that beacons back all kinds of information about how well that video played. So it gives us frame rates and sounds and whether you paused it, whether you played it, whether the data dropped out or whatever.
That thing just sits there and gets the holy kicked out of it. Because that's all it's doing. It's accepting that beacon, and it's dumping it into a database somewhere. But that's real time that it does that, so there is no caching in front of these things. So that's pretty slick, right? How about that one?
And like I said, this is the about a 2.5-3% bucket of what we're actually testing. This also, this is the Yahoo network, so this has nothing to do with any of the people we've bought or Flickr. Any of you guys know that Flickr runs Node a little bit? You guys tried that? You guys seen the new Flickr page, when you click on it, it shows you the brand new, brand spanking new image, right?
The number one page visited on Flickr is that view one single image, right? They planned forty machines, they had all forty of these machines, they were saying, we can do 14,0000 requests per second, per machine, that's what we're going to have to do to turn these things on, and one of the guys, I don't know, who did it, decided to say 0 to 100% bucket, you all get it.
They even didn't think, they didn't even do 1%, they just went 0-100% and they shut 30 machines off because they didn't need them.
Node in Production
See techniques for deploying a large-scale, high-uptime production cluster.