Profiling Node.js Processes on SmartOS
Introduction
In this blog post, I will show ways to gather information about Node.js processes running on SmartOS (how to do this on other operating systems is left as an exercise for the reader). Specifically, we'll examine cpu utilization, memory usage, file system I/O, and network I/O, using DTrace plus a few other tools available on SmartOS instances running in the Joyent Public Cloud.
It should be mentioned that as I gathered information to write this post, I found that cloud analytics already does most everything you might want. Also, Brendan Gregg of Joyent has written extensively about performance, both in blog posts and in books. For instance, take a look at Brendan's Blog, specifically the section on Node.js latency. The DTracescripts and other utilities I show below should work on any SmartOS system, (and most should work on any illumos or Solaris-10/11 based systems).
Most (all?) of the DTrace scripts already exist in one form or another at Brendan Gregg's DTraceToolkit. The DTraceToolkit contains a set of commands that useDTrace to measure CPU utilization, file system I/O, network I/O, and more.
Most of the following will work on any system that supports DTrace (with a few, if any, minor modifications). And everything will work with other processes besides node.
The D scripts and other tools used in this post could be used as a basis for metering or monitoring your node applications.
CPU Utilization
The simplest command to measure CPU utilization for a given Node.js process is prstat(1M).
(Note that top
is available and can be installed using pkgin(1)
).
Here is example output:
# pgrep node94508# prstat -c -p 94508Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M cpu1 1 0 0:26:56 50% node/2Total: 1 processes, 2 lwps, load averages: 1.47, 2.28, 2.22 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M cpu0 1 0 0:27:01 50% node/2Total: 1 processes, 2 lwps, load averages: 1.43, 2.26, 2.21 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M cpu1 1 0 0:27:06 50% node/2Total: 1 processes, 2 lwps, load averages: 1.40, 2.24, 2.21 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M cpu1 1 0 0:27:11 50% node/2Total: 1 processes, 2 lwps, load averages: 1.37, 2.22, 2.20 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M run 1 0 0:27:16 50% node/2Total: 1 processes, 2 lwps, load averages: 1.34, 2.20, 2.19 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 94508 root 36M 20M cpu0 1 0 0:27:21 50% node/2Total: 1 processes, 2 lwps, load averages: 1.31, 2.18, 2.19(^C)#
The above output shows a node process that uses about 50% of the CPU available on the system. The system being used has 2 cpus, so the amount of CPU that is being used is equivalent to 1 processor's worth. In general, prstat
shows recent CPU usage by the process.
What if we wanted a closer look?
# prstat -cmL -p 94508Please wait... PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 94508 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0 584 14 0 node/1 94508 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.2 476 1 476 0 node/2Total: 1 processes, 2 lwps, load averages: 0.93, 1.25, 1.75 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 94508 root 99 0.0 0.0 0.0 0.0 0.0 0.0 0.9 0 629 0 0 node/1 94508 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.3 478 1 478 0 node/2Total: 1 processes, 2 lwps, load averages: 0.94, 1.25, 1.75 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 94508 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 297 0 0 node/1 94508 root 0.0 0.0 0.0 0.0 0.0 0.0 100 0.1 474 0 474 0 node/2Total: 1 processes, 2 lwps, load averages: 0.95, 1.25, 1.75
The "-L" option shows a row for each thread in the process, and "-m" shows microstate information for each thread. From this output, 1 thread of node is spending all available time running user level code, the other thread is sleeping (SLP = 100) all of the time. The total CPU usage is shown as a percentage (the sum of the USR, SYS, RTP, TFL, and DFL fields. See the man page).
While we're at it, let's look at the CPUs on this system.
# psrinfo -vpThe physical processor has 1 virtual processor (0) x86 (GenuineIntel 306A9 family 6 model 58 step 9 clock 2392 MHz) Intel(r) Core(tm) i7-3635QM CPU @ 2.40GHzThe physical processor has 1 virtual processor (1) x86 (GenuineIntel 306A9 family 6 model 58 step 9 clock 2392 MHz) Intel(r) Core(tm) i7-3635QM CPU @ 2.40GHz#
Let's use DTrace to get the exact amount of CPU that the process is using. The following script (cpu_usage.d
), will print out once per second the amount of time spent by a given process (pid passed in as an argument, which is $target
in the script). It shows the amount of time spent executing on each processor, as well as the total time. Note that for threads that are switching on and off cpu very often, it may be better to use the DTrace profile provider rather than the sched provider as it may have less overhead, though it won't be as accurate.
#!/usr/sbin/dtrace -qssched:::on-cpu/pid == $target/{ self->ts = timestamp;}sched:::off-cpu/self->ts/{ @[cpu] = sum(timestamp - self->ts); @total = sum(timestamp - self->ts); self->ts = 0;}tick-1sec{ printf("%12s %12s\n", "CPU ID", "NSECS"); printa(@); printa("Total time in nsecs = %@d\n", @total); trunc(@); trunc(@total);}
Running the above for a given Node.js process:
# ./cpu_usage.d -p `pgrep node` CPU ID NSECS 1 1261605 0 998760443Total time in nsecs = 1000022048 CPU ID NSECS 1 6171524 0 986705066Total time in nsecs = 992876590 CPU ID NSECS 0 3576080 1 983075132Total time in nsecs = 986651212 CPU ID NSECS 1 6374274 0 997694246Total time in nsecs = 1004068520(^c)#
This is for the same compute-bound node program shown earlier. And, of course, this could be for any process, not just node. Note that not all of the work is done only on 1 cpu. We could break this down further by adding the thread id (tid) as a key. Here is the cpu_usage_thread.d
script.
#!/usr/sbin/dtrace -qssched:::on-cpu/pid == $target/{ self->ts = timestamp;}sched:::off-cpu/self->ts/{ @[cpu, tid] = sum(timestamp - self->ts); @total = sum(timestamp - self->ts); self->ts = 0;}tick-1sec{ printf("%10s %8s %10s", "CPU ID", "TID", "NSECS"); printa(@); printa("Total time in nsecs = %@d\n\n", @total); trunc(@); trunc(@total);}
And the output:
# ./cpu_usage_thread.d -p `pgrep node` CPU ID TID NSECS 1 2 962332 1 1 9746023 0 1 970480428Total time in nsecs = 981188783 CPU ID TID NSECS 1 2 959221 1 1 4316602 0 1 1000276644Total time in nsecs = 1005552467(^c)#
Expressing CPU usage as a percentage can have multiple meanings. It could be the percentage of the total amount of CPU available on the system. On a virtualized SmartOS instance (i.e., zone), it could mean a percentage of the CPU cap for that zone. For the first meaning, on a system with 2 CPUs, a value of 1005552467 nsecs in 1 second would be just over 50% (1005552467/2000000000, where 2000000000 nsecs is 2 seconds worth of CPU time). To find the cap for a zone, use prctl(1).
# prctl -n zone.cpu-cap -i zone $(zonename)zone: 63: d6e7ac12-a60c-4a1d-9aea-d8d9983ee6c8NAME PRIVILEGE VALUE FLAG ACTION RECIPIENTzone.cpu-cap usage 2 privileged 600 - deny - system 4.29G inf deny#
So, the CPU cap on this instance is 600, or 6 CPUs. The system considers the zone to be "bursting" if the CPU usage is above cpu-baseline:
# prctl -n zone.cpu-baseline -i zone $(zonename)zone: 63: d6e7ac12-a60c-4a1d-9aea-d8d9983ee6c8NAME PRIVILEGE VALUE FLAG ACTION RECIPIENTzone.cpu-baseline usage 40 privileged 40 - none - system 4.29G max none -#
So, when processes in the zone use more than 40% of 1 CPU in total, the zone is "bursting". Let's run the compute bound node process multiple times on a system with the same cap and baseline, but with more (16) CPUs.
# psrinfo0 on-line since 02/12/2013 01:05:021 on-line since 02/12/2013 01:05:092 on-line since 02/12/2013 01:05:093 on-line since 02/12/2013 01:05:094 on-line since 02/12/2013 01:05:095 on-line since 02/12/2013 01:05:096 on-line since 02/12/2013 01:05:097 on-line since 02/12/2013 01:05:098 on-line since 02/12/2013 01:05:099 on-line since 02/12/2013 01:05:0910 on-line since 02/12/2013 01:05:0911 on-line since 02/12/2013 01:05:0912 on-line since 02/12/2013 01:05:0913 on-line since 02/12/2013 01:05:0914 on-line since 02/12/2013 01:05:1015 on-line since 02/12/2013 01:05:10#
Running 5 instances of the process shows the following in prstat(1M).
# prstat -mcLv -p `pgrep -d "," node`Please wait... PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24382 root 100 0.1 0.0 0.0 0.0 0.0 0.0 0.1 1 227 40 0 node/1 24386 root 100 0.1 0.0 0.0 0.0 0.0 0.0 0.1 1 217 40 0 node/1 24389 root 100 0.1 0.0 0.0 0.0 0.0 0.0 0.1 2 202 41 0 node/1 24321 root 100 0.1 0.0 0.0 0.0 0.0 0.0 0.1 3 219 39 0 node/1 24245 root 100 0.1 0.0 0.0 0.1 0.0 0.0 0.1 5 225 39 0 node/1 24321 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.1 499 1 499 0 node/2 24389 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.1 500 1 500 0 node/2 24382 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 500 0 500 0 node/2 24386 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 499 1 499 0 node/2 24245 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 500 0 500 0 node/2Total: 5 processes, 10 lwps, load averages: 2.33, 0.59, 0.20 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24386 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 142 0 0 node/1 24389 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 102 0 0 node/1 24382 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 218 0 0 node/1 24321 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.1 3 112 0 0 node/1 24245 root 100 0.0 0.0 0.0 0.0 0.0 0.0 0.1 4 220 0 0 node/1 24321 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.1 500 1 500 0 node/2 24245 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 501 2 500 0 node/2 24386 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 500 1 500 0 node/2 24382 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.0 501 1 500 0 node/2 24389 root 0.0 0.1 0.0 0.0 0.0 0.0 100 0.1 500 0 500 0 node/2Total: 5 processes, 10 lwps, load averages: 2.54, 0.66, 0.23...(^c)#
From this output, the 5 processes are spending 100% each in User mode (executing code in the application). Let's see what the current CPU usage is with respect to the CPU cap.
# prctl -n zone.cpu-cap -i zone $(zonename)zone: 63: d6e7ac12-a60c-4a1d-9aea-d8d9983ee6c8NAME PRIVILEGE VALUE FLAG ACTION RECIPIENTzone.cpu-cap usage 356 privileged 600 - deny - system 4.29G inf deny -#
So, even though there are 5 processes running non-stop, the system only counts 356 towards the cap of 600, (i.e., roughly 3.5 CPUs worth). Note that you may see the numbers being more erratic. One would expect the current usage to be slightly more than 500 (each of the node processes has 2 threads, one thread is doing almost all of the work, but the other thread is also doing some work (500 system calls). Using DTrace on the node processes gives exact times.
# ./cpu_usage_name.d node CPU ID NSECS 6 17358262 14 120929836 8 138903983 13 170075625 11 209486653 2 214565944 4 227744020 10 293542515 12 338535596 5 365796020 1 377194770 15 444786533 7 470248592 0 496330717 3 656849191 9 838369383Total time in nsecs = 5380717640 CPU ID NSECS 8 27498575 1 74287799 4 77530043 14 77689710 10 180125042 15 207094470 11 284097209 12 290378415 0 294754013 2 301221145 9 316360663 6 337518185 7 367458998 5 395776929 13 478533303 3 497458312Total time in nsecs = 4207782811(^c)#
So sometimes the processes are over the cap, and sometimes under the cap. The cpu_usage_name.d
script is a slight modification from the scripts shown earlier.
#!/usr/sbin/dtrace -qssched:::on-cpu/execname == $$1/{ self->ts = timestamp;}sched:::off-cpu/self->ts/{ @[cpu] = sum(timestamp - self->ts); @total = sum(timestamp - self->ts); self->ts = 0;}tick-1sec{ printf("%12s %12s", "CPU ID", "NSECS"); printa(@); printa("Total time in nsecs = %@d\n\n", @total); trunc(@); trunc(@total);}
The CPU usage calculation used to enforce the CPU cap is performed once every 10msecs (clock tick). According to a comment (and the code itself), if a thread uses multiple clock ticks without going off CPU, it will only be charged 1 tick towards its cap. The relevant code is in cpucaps.c. Note specifically the comment starting at line 39, and the caps_charge_adjust()
function in that file. For compute-bound threads, the time slice for which they are scheduled will often be longer than 1 clock tick. We could check this further with DTrace, but will need to do that in the global zone. Using DTrace to show that the discrepancy can be explained by this is left as an exercise.
In this example, we only examined CPU-bound node processes. We could spend more time here using DTrace and flamegraphs. Flamegraphs are a visualization tool that allow you to see where your app is spending its time.
File System Utilization
To "see" file system utilization system wide within a SmartOS virtualized instance, you can use vfsstat(1M)
. The following gives output once a second. The first line should be ignored; it is an average since the system was booted. See the man page for a description of the columns.
# vfsstat 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 10.2 0.0 7.8 0.1 0.0 0.0 0.0 0.0 0 0 0.0 32.0 d6e7ac12 (63) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 d6e7ac12 (63) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 d6e7ac12 (63) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 d6e7ac12 (63)(^c)#
To examine file system I/O by node processes, we'll watch system calls that are accessing files. We'll start with a simple example that simply reports the number of file system I/O calls per second made by a node app.
Here's the script:
#!/usr/sbin/dtrace -qssyscall::read:entry,syscall::readv:entry,syscall::pread:entry/pid == $target && fds[arg0].fi_fs == "zfs"/{ @r[pid] = count();}syscall::write:entry,syscall::writev:entry,syscall::pwrite:entry/pid == $target && fds[arg0].fi_fs == "zfs"/{ @w[pid] = count();}tick-1s{ printf("\nIOPS\n"); printa("Read operations: pid = %d, #io = %@d\n", @r); printa("Write operations: pid = %d, #io = %@d\n", @w); trunc(@r); trunc(@w);}
Running the script on some process:
# ./node_fsrw.d -p 7767IOPSIOPSWrite operations: pid = 7767, #io = 7IOPSIOPSWrite operations: pid = 7767, #io = 11...(^c)#
That tells me about IOPS – what about throughput, i.e., the number of bytes read/written?
#!/usr/sbin/dtrace -qssyscall::read:entry,syscall::readv:entry, syscall::pread:entry/pid == $target && fds[arg0].fi_fs == "zfs"/{ self->inr = 1; self->fd = arg0;}syscall::read:return,syscall::readv:return, syscall::pread:return/self->inr && pid == $target && fds[self->fd].fi_fs == "zfs" && arg1 >= 0/{ @r = sum(arg1); self->inr = 0;}syscall::write:entry,syscall::writev:entry, syscall::pwrite:entry/pid == $target && fds[arg0].fi_fs == "zfs"/{ self->inw = 1; self->fd = arg0;}syscall::write:return,syscall::writev:return, syscall::pwrite:return/self->inw && pid == $target && fds[self->fd].fi_fs == "zfs" && arg1 >= 0/{ @w = sum(arg1); self->inw = 0;}tick-1s{ printf("\nThroughput in bytes/second\n"); printa("Read: %@d\n", @r); printa("Write: %@d\n", @w); trunc(@r); trunc(@w);}
The above script watches for read/write related system calls being made on files that are in zfs file systems. For each read/write, it sums the number of bytes read/written in aggregates and prints them out once a second. Note that for other file systems (for instance, on BSD, the value of "zfs"
will be different).
Here is output from the script.
# ./node_rwthruput.d -p 7767Throughput in bytes/secondThroughput in bytes/secondThroughput in bytes/secondThroughput in bytes/secondThroughput in bytes/secondWrite: 3056Throughput in bytes/secondWrite: 6112(^c)#
Using DTrace, we can drill down to find out what file(s) are being read/written as well as time distribution showing how long they are taking. We can also get stack backtraces showing where the reads/writes are occurring in the code.
Network Utilization
For network utilization, we'll look at the number of packets sent/received, and the number of bytes sent/received. To see this for each network device system wide, use netstat(1M)
:
# netstat -I net1 -i 1 input net1 output input (Total) outputpackets errs packets errs colls packets errs packets errs colls22055630 0 8531 0 0 29403866 0 316911 0 019 0 0 0 0 27 0 1 0 010 0 0 0 0 17 0 1 0 08 0 0 0 0 14 0 1 0 014 0 0 0 0 21 0 1 0 011 0 0 0 0 16 0 1 0 07 0 0 0 0 10 0 1 0 0(^c)#
Again, the first line can be ignored (it shows statistics since the system was booted). If you are running in the global zone, you can use the nicstat(1M) command to get throughput in both packets and data sent/received per second.
To watch network traffic for node processes, we'll use DTrace to watch I/O-related system calls on sockets. Note that this is not necessarily the number of packets sent/received, but is a reasonable approximation. An accurate count requires access to the kernel in the global zone.
#!/usr/sbin/dtrace -qssyscall::read:entry,syscall::readv:entry,syscall::recv:entry/pid == $target && fds[arg0].fi_fs == "sockfs"/{ @r = count();}syscall::write:entry,syscall::writev:entry,syscall::send:entry/pid == $target && fds[arg0].fi_fs == "sockfs"/{ @w = count();}tick-1sec{ printf("%13s %14s\n", "INPUT PACKETS", "OUTPUT PACKETS"); printa("%@13d %@14d\n", @r, @w); trunc(@r); trunc(@w);}
And here is output from the script for a given node process.
# ./net_iops.d -p 82463INPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETS 14 22INPUT PACKETS OUTPUT PACKETS 129 128INPUT PACKETS OUTPUT PACKETS 57 50INPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETSINPUT PACKETS OUTPUT PACKETS(^c)#
To see the throughput (i.e., the number of bytes read/written, we'lluse:
#!/usr/sbin/dtrace -qssyscall::read:entry,syscall::readv:entry,syscall::recv:entry,syscall::recvfrom:entry/pid == $target && fds[arg0].fi_fs == "sockfs"/{ self->rfd = arg0;}syscall::read:return,syscall::readv:return,syscall::recv:return,syscall::recvfrom/pid == $target && fds[self->rfd].fi_fs == "sockfs" && arg0 > 0/{ @r = sum(arg0);}syscall::write:entry,syscall::writev:entry,syscall::send:entry,syscall::sendto:entry/pid == $target && fds[arg0].fi_fs == "sockfs"/{ self->wfd = arg0;}syscall::write:return,syscall::writev:return,syscall::send:return,syscall::sendto:return/pid == $target && fds[self->wfd].fi_fs == "sockfs" && arg0 > 0/{ @w = sum(arg0);}tick-1sec{ printf("%13s %14s\n", "INPUT BYTES", "OUTPUT BYTES"); printa("%@13d %@14d\n", @r, @w); trunc(@r); trunc(@w);}
And here is output for the script.
# ./net_throughput.d -p 82463 INPUT BYTES OUTPUT BYTES INPUT BYTES OUTPUT BYTES INPUT BYTES OUTPUT BYTES 4785 12412 INPUT BYTES OUTPUT BYTES 7920 20544 INPUT BYTES OUTPUT BYTES 3795 9844 INPUT BYTES OUTPUT BYTES INPUT BYTES OUTPUT BYTES(^c)#
Note that there are DTrace probes built in to node itself that give you information about garbage collection, client and server requests and responses, and other information. See DTrace Probes for node for more information. You can also add your own DTrace probes to your node application (see node-dtrace-provider). And if you are using the node-restify, there are built-in DTrace probes for each route that your app specifies.
Memory Utilization
For system-wide memory utilization, you can use vmstat(1M). For memory utilization for a specific zone (a SmartOS virtualized instance), use zonememstat(1M)
. For memory utilization by a specific node process, use pmap(1). For instance:
# while true; do pmap -x 82463 | egrep 'Address|total'; sleep 1; done Address Kbytes RSS Anon Locked Mode Mapped Filetotal Kb 51400 42048 32920 - Address Kbytes RSS Anon Locked Mode Mapped Filetotal Kb 51400 42056 32920 -(^c)#
The important columns in the pmap(1)
output are Kbytes
and RSS
. These are virtual address space size (in kbytes), and the Resident Set Size (amount in kbytes that is physically allocated in memory) respectively. If Kbytes
grows over time, it may be a sign that you have a memory leak. But it could also simply mean that your node process is busy.
Note that for large RSS processes, doing this every second adds a lot of overhead. It is probably better once every 5 to 30 seconds.
To DTrace heap growth for a process, you can use:
# dtrace -n 'syscall::brk:entry/pid == $target/{@[jstack(500,8000)]=count();} tick-60s{printa(@); trunc(@);}' -p `pgrep node`
Note that node (and other applications), can grow space on the heap using brk(2), but they can also grow space using mmap(2). See Brendan's blog: Memory Leak (and Growth) Flame Graphs and <ahref="http://www.joyent.com/blog/bruning-questions-where-s-my-memory-going">Bruning Questions: Where's My Memory Going? for details.
Conclusions
Using DTrace to get information about Node.js applications allows you to examine the entire software stack. It also works both at a macro level (see the entire system), and allows you to drill down to see individual processes at the instruction level.
Note: vmstat has been extensively covered by Brendan Gregg in a series of short videos:
Post written by Mr. Max Bruning