Node.js in production: runtime log snooping

This post is the one of several about how we operate Node.js in production at Joyent. Most of my experience comes from working on our Manta Storage Service, which currently comprises a few thousand Node processes across three data centers.

Logging is one of the most primitive but most valuable forms of postmortem debugging. Logs let you figure out why a program crashed (or did the wrong thing) after it's already done so. In this post I'll describe a critical feature of the Bunyan logging library that we use daily to help debug failures in production: runtime log snooping.

The problem

In many production systems, there's a nasty tradeoff: the more verbose your logs, the more information you have to debug issues, but the greater the overhead of constructing, recording, transmitting, storing, and analyzing these logs as well. Log frameworks typically manage this with log levels: you use "debug" in development, which spews all kinds of information about what the program is doing, but you use "info" or some higher level in production, which filters only the "important" messages.

But it's often in production that you really need the debug messages that aretoo expensive to emit in production. It's when you've been paged in the middleof the night because some service's error count has exceeded some threshold. Ifyou're lucky, the failure is simple enough that the program can tell you exactlywhat's happened, but more often, particularly in distributed systems, wherefailure modes aren't so crisp, the error will be somewhat vague. You'll needmore information about what the program was doing in order to figure out whatwent wrong.

In many of these cases, restarting the service will bring it back to life, butit will also eliminate any hope of root causing the problem. You can alreadysee yourself getting paged again a few nights later for the same issue...

Bunyan

Trent Mick at Joyent wrote a Node module calledBunyan that we use in all of ourservices to log everything from free-form text intended for humans toprogrammatically consumable events that trip alarms or feed into our billingsystem. Bunyan logs are just newline-separated JSON records. A few fields havespecial meaning, like "time" and "msg", but programs can log whatever other JSONstructures they want for interpretation by humans or other programs. (In Manta,we use Bunyan log records for billing. These log entries include structuredpayloads that describe the billable operation, the user, the result, and soon.1) The "bunyan" command-line tool facilitates easyfiltering, colorizing output, and so on, while programmatic consumers can parseeach line as a JSON object.

There are lots of logging libraries out there, but Bunyan's killer feature isthe ability to snoop debug-level logs in production, without restarting (orotherwise impacting) the service. When an application callslog.debug("something went wrong"), and the log level is "info" (i.e., higherthan "debug"), the bunyan library drops the message, as you'd expect. When youwant to get at these debug messages, you fire up "bunyan -p *", which is justsugar for:

$ dtrace -Z -x strsize=4k -x switchrate=10hz \    -q -n 'bunyan*:::log-*{ printf("%s\n", copyinstr(arg0)); }' | bunyan

When you run this, you're enabling DTrace probes built into Bunyan. Enablingthese probes actually modifies your Node program's behavior to fire the probejust before dropping the log message. Firing the probe traps into the kernel,where the log message is buffered and eventually funneled up to the "dtrace"process and printed out to your terminal.

This is a subtle point, but it's very important: enabling these probes doesn'tcause the debug messages to appear in the regular log, but rather it causesthese messages to be collected, buffered by the kernel, and then made availableto your "dtrace" process. That means your service is still running at near fullspeed, not going through the code paths that write entries to the log.Importantly, if the volume of messages gets too high, DTrace will drop themrather than slow down the process. (You can tune the "switchrate" and buffersizes to control that.)

When you don't need the extra logging any more, CTRL-C the "bunyan" (or"dtrace") process. The program modifications are undone, and your program goeson dropping debug log entries.

Voilá! By using Bunyan as a logging library, you get the ability to walk up toa production service and enable debug logging without restarting it. This hasproven critical for a number of cases where it's too expensive to log debugmessages all the time, but we had a live service behaving badly, and restartingthe service would likely have cleared the issue without giving us a chance toroot-cause it.

Try it yourself

I ran these examples on OS X because it's convenient, but all of our productionservices run on SmartOS (anillumos distribution). This facility works on anysystem that supports DTrace USDT.

If you've got Node installed already, it's ridiculously easy to try this out.First, create a temporary directory and install bunyan locally:

dap@sharptooth $ npm install bunyan

then start a Node REPL, load up bunyan, and create a logger at level "info":

dap@sharptooth $ node> var r = require('bunyan')undefined> var l = new r({ 'name': 'test', 'level': 'info' })undefined

At this point, you can log "info" messages, but "debug" messages will be hidden:

> l.info('test info message'){"name":"test","hostname":"sharptooth.local","pid":1076,"level":30,"msg":"test info message","time":"2013-09-27T18:13:00.820Z","v":0}undefined> l.debug('something went wrong')undefined

Now in another shell, use "bunyan -p":

# In another shell:dap@sharptooth $ sudo bunyan -p '*'   # sudo required for DTrace

That will block waiting for something to happen. Back in the first shell, logthat debug message again:

> l.debug('something went wrong')undefined

and over in the second shell, you should see something like this:

[2013-09-27T18:16:19.679Z] DEBUG: test/1076 on sharptooth.local: something went wrong

Success! Because we used bunyan -p '*', we'll see log messages from allNode processes on the system. You can also specify a specific pid instead of'*'.

Dynamic probes for your application

Bunyan's dynamic probes are implemented using Chris Andrews'snode-dtrace-provider module,which makes it very easy to add custom probes to any application. (The wholeBunyan integration with DTrace is about 20 lines of code.)

Another great example that we also use heavily isnode-restify, which creates DTraceprobes for the beginning and end of each HTTP request, as well as each of thephases of request processing (auth, parsing, logging, and so on). You canenable these probes in production to get a fine-grained look at requestlatency, including what parts of eachrequest are taking a long time.

In this post, I described Bunyan, the Node.js logging library we use at Joyent.I explained how we use DTrace probes built into Bunyan to extract more debuginformation from a production service than we can normally extract withoutdisrupting the service itself. We focus on tools for production because thehardest issues often show up only in production deployments, and the productionenvironment is far more constrained than development. By building toolsdesigned for production use, we're able to debug failures both in the field andin development, in many cases root-causing them the first time we see them.


1 Manta service logs are a good example of both the value ofstructured logs and Manta's ability to process these logs. Manta service logsinclude both debug messages and more structured messages used to generate user'susage reports and bills. These logs are uploaded periodically to Manta. Togenerate each user's report, we run a map-reduce Manta job over these logs thatparses each record and tallies up reports per-user. Similarly, when we want tosee how often some particular error message has been seen, we run a Manta job togrep over the logs (literally just "grep PATTERN"). For more on using Manta forlog analysis, check out Wanelo's post on business data analytics usingManta.



Post written by David Pacheco