Bruning Questions: prstat(1M) output

April 22, 2013 - by Mr. Max Bruning

Two weeks ago, I wrote about the /proc tools, but didn't talk about one of the most useful tools that use /proc on SmartOS. This is prstat(1M). This tool is analagous in many ways to top(1) on Linux. Note that top(1) does work on SmartOS/illumos/Solaris based systems. Brendan Gregg wrote an article comparing the performance impact of prstat(1M) versus top(1) a few years back, available here. Rather than concentrate on performance, I want to spend some time going into details of prstat(1M) output, and how it can be useful. This blog will also show some DTrace, because it is such a useful tool, and I like it so much.

First, if you want to use top(1) on a SmartOS SmartMachine, do the following:

# pkgin install top-3.6.1nb1
calculating dependencies... done.

nothing to upgrade.
1 packages to be installed: top-3.6.1nb1 (77K to download, 0B to install)

proceed ? [Y/n] Y
downloading packages...
top-3.6.1nb1.tgz                    100%   37KB  37.5KB/s  37.5KB/s   00:00    
installing packages...
installing top-3.6.1nb1...
pkg_install warnings: 0, errors: 0
reading local summary...
processing local summary...
updating database: 100%
marking top-3.6.1nb1 as non auto-removable
#

Now, on to prstat(1M).

In its simplest form, prstat reports statistics about processes currently on the system, sorted by cpu usage and updated every 5 seconds. By default, it will show as many processes as fit in the window or terminal where it is being run.

# prstat
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
  7160 root     3864K 2772K cpu0     1    0   0:00:00 0.2% prstat/1
  6919 root     3176K 2180K sleep    1    0   0:00:00 0.0% bash/1
  3708 root     4020K 1520K sleep   59    0   0:00:00 0.0% sshd/1
  3680 root     2376K 1424K sleep   56    0   0:00:00 0.0% ttymon/1
  3571 root     5500K 3500K sleep   56    0   0:00:18 0.0% nscd/30
  3675 root     2120K 1324K sleep    1    0   0:00:00 0.0% sac/1
  3683 root     4076K 2624K sleep   56    0   0:00:00 0.0% inetd/3
  3676 root     3276K 2352K sleep   59    0   0:00:00 0.0% rsyslogd/5
  3681 root     1632K  920K sleep    1    0   0:00:01 0.0% utmpd/1
  3094 root     6648K 5072K sleep   59    0   0:00:09 0.0% svc.startd/12
  3107 root     8136K 6976K sleep   56    0   0:00:18 0.0% svc.configd/14
  3674 root     1860K 1212K sleep    1    0   0:00:00 0.0% cron/1
  3056 root     2136K 1536K sleep    1    0   0:00:02 0.0% init/1
  3697 root     1932K 1372K sleep   59    0   0:00:00 0.0% ttymon/1
  3446 root     2528K 1644K sleep   59    0   0:00:00 0.0% pfexecd/3
Total: 18 processes, 81 lwps, load averages: 0.00, 0.00, 0.00

The above output shows the process id, user, size of the virtual address space, the resident set size (size currently in memory), state (on which processor the process is running, or sleep if the process is waiting), scheduling priority (higher is run before lower values), nice value, time spent by the process, percentage of cpu used by the process, the process "name", and the number of threads in the process For instance, nscd, process id 3571 has 30 threads.

Unlike top, prstat has few interactive commands. Typing 'q' causes prstat to quit. Other input causes it to refresh the display. Also, you can specify an interval (called "delay" in top) of 0 seconds, which causes prstat to continuously update the output (Try prstat 0 to see this). The default interval is 5 seconds.

prstat can be useful in determining not only which processes/threads are using memory and cpu resources on your system. You can also use it to focus on specific users, processes, zones, and tasks. (Tasks on SmartOS are not the same as processes. See newtask(1) for details.)

Like top, prstat redraws the screen every interval. To get "continuous" output, use the -c option. Unlike top, prstat can be used to report microstate process accounting information. This option reports percentage of time spent in user mode, kernel mode, and sleeping. It also reports the number of voluntary and involuntary context switches, the number of system calls, and the number of signals received. In addition, this reports time spent processing traps, text page faults, data page faults, waiting for user level lcoks, and waiting for the CPU (shown as LAT).

Let's look at an example.

# prstat -c
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 76798 root       10M 9380K sleep   59    0   0:05:05 3.2% tserver/1001
 16145 103        56M   46M sleep   59    0   0:43:11 0.0% beam.smp/51
    95 root        0K    0K sleep   99  -20   0:47:33 0.0% zpool-zones/162
  5879 root     9680K 4976K sleep   57    0   0:23:24 0.0% zoneadmd/5
 16790 nobody     58M   36M sleep   59    0   0:11:53 0.0% node/4
 16610 nobody     61M   42M sleep   59    0   0:22:39 0.0% node/3
 17119 907        45M   24M sleep   59    0   0:14:41 0.0% node/4
 15050 root       58M   39M sleep   59    0   0:11:56 0.0% node/4
 15119 nobody    141M  104M sleep    1    0   0:25:05 0.0% node/4
  3570 root       26M   18M sleep  100    -   0:05:40 0.0% node/6
 14710 root       22M   17M sleep    1    0   0:01:26 0.0% node/3
 14489 root       43M   20M sleep    1    0   0:03:30 0.0% node/5
 14661 root       45M   23M sleep   59    0   0:03:25 0.0% node/5
 14107 root       18M   14M sleep   58    0   0:06:32 0.0% node/6
 14022 root       22M   17M sleep    1    0   0:01:35 0.0% node/3
Total: 462 processes, 3132 lwps, load averages: 0.91, 0.75, 0.50
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 76798 root       10M 9380K sleep   59    0   0:05:08 3.4% tserver/1001
    95 root        0K    0K sleep   99  -20   0:47:33 0.0% zpool-zones/162
 16145 103        56M   46M sleep   59    0   0:43:11 0.0% beam.smp/51
 16610 nobody     62M   43M sleep   59    0   0:22:39 0.0% node/3
 16790 nobody     58M   36M sleep   59    0   0:11:53 0.0% node/4
 17119 907        45M   24M sleep   59    0   0:14:41 0.0% node/4
 15050 root       58M   39M sleep   59    0   0:11:56 0.0% node/4
  5879 root     9680K 4976K sleep   57    0   0:23:24 0.0% zoneadmd/5
 15119 nobody    141M  104M sleep    1    0   0:25:05 0.0% node/4
  3180 root       53M   30M sleep   59    0   0:11:54 0.0% node/7
 14489 root       43M   20M sleep    1    0   0:03:30 0.0% node/5
  3570 root       26M   18M sleep  100    -   0:05:40 0.0% node/6
 16055 root       45M   24M sleep    1    0   0:03:26 0.0% node/5
 14107 root       18M   14M sleep   58    0   0:06:32 0.0% node/6
 14022 root       22M   17M sleep    1    0   0:01:35 0.0% node/3
Total: 462 processes, 3132 lwps, load averages: 0.91, 0.75, 0.50
...

From the above output, the tserver process is spending 3.2% of its time on the cpus. tserver is a multithreaded process (1001 threads) that accepts connection requests and immediately closes the connection. It was written specifically for this blog, along with a client running on other machine(s). Let's take a look at microstate information. (There is a detailed description of microstate accounting in msacct.c.)

# prstat -cmLp 76798
Please wait...
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 76798 root     7.8 0.7 0.0 0.0 0.0 3.8  88 0.1  1K 124  8K   0 tserver/1
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 131   0 133   0 tserver/661
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 103   0 104   0 tserver/271
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 108   0 109   0 tserver/460
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 107   0 108   0 tserver/374
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 182   0 185   0 tserver/487
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 132   0 133   0 tserver/526
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0  84   0  84   0 tserver/663
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 102   0 103   0 tserver/604
 76798 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 160   0 163   0 tserver/710
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0  95   0  96   0 tserver/229
 76798 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0  99   0 100   0 tserver/202
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0  91   0  91   0 tserver/389
 76798 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0  89   0  89   0 tserver/288
 76798 root     0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 129   0 131   0 tserver/507
Total: 1 processes, 1001 lwps, load averages: 1.08, 1.07, 0.96
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 
 76798 root      12 0.7 0.0 0.0 0.0 6.9  80 0.1  2K 212  8K   0 tserver/1
 76798 root     0.2 0.0 0.0 0.0 0.0 100 0.0 0.0 121   0 121   0 tserver/487
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 121   0 121   0 tserver/712
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 121   0 121   0 tserver/460
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 118   0 118   0 tserver/661
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 122   1 122   0 tserver/374
 76798 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 318   3 356   0 tserver/788
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 120   0 120   0 tserver/670
 76798 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 383   0 408   0 tserver/682
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 120   0 120   0 tserver/710
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 118   1 119   0 tserver/648
 76798 root     0.0 0.1 0.0 0.0 0.0 100 0.0 0.0 337   6 353   0 tserver/542
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 117   1 118   0 tserver/271
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 113   1 114   0 tserver/202
 76798 root     0.1 0.0 0.0 0.0 0.0 100 0.0 0.0 107   0 108   0 tserver/518
Total: 1 processes, 1001 lwps, load averages: 1.09, 1.07, 0.96
...

From the above output, lwpid 1 is spending 12% of its time in user mode (USR), 0.7% in kernel mode (SYS). It is not processing traps, text page faults, or data page faults (TRP, TFL, and DFL respectively. Lwpid 1 is waiting on user level locks 6.9% of its time, the other lwps (i.e., threads) in the process are spending all of their time waiting for user level locks. Lwpid 1 is sleeping 80% of its time (SLP). There appears to be almost 0 contention for the cpus on the system (LAT is 0 except for 0.1% for lwpid 1. Lwpid 1 has make ~2000 voluntary context switches, the others between ~100 and ~400 (VCX). Involuntary context switches are close to 0 for everyone except lwpid 1 (ICX). Voluntary context switches occur when a thread gives up the cpu because it can not do further work (for instance, waiting for a lock or waiting for I/O). Involuntary context switches occur when a thread is preempted or has used up its time slice. The threads are making a few hundred system calls (except for lwpid 1, with 8K of system calls (SCL). No thread is handling any signals.

Perhaps the most interesting thing to note in the output is that except for lwpid 1, the threads (lwpids) shown are different for each interval.

Let's use DTrace to see scheduling activity for the process's threads.

# dtrace -n 'sched:::on-cpu/execname == "tserver"/{@[tid]=count();} tick-1sec{printa(@); clear(@);}'
dtrace: description 'sched:::on-cpu' matched 4 probes
CPU     ID                    FUNCTION:NAME
  8  67387                       :tick-1sec
...  <-- almost 1000 lines omitted
      498              126
      271              128
      799              128
      458              132
       78              133
      482              136
      644              143
      206              158
      931              162
        1             2236

From this, every thread in the process is getting on cpu in a 1 second interval (the total number of lines of output every second is 1001, the total number of threads), yet prstat shows 100% of time for almost all of these threads to be waiting for user level locks.

As mentioned above, tserver accepts connection requests and immediately closes the connection. Let's see how many accept(3SOCKET) calls it handles per second.

# dtrace -qn 'syscall::accept:return/pid == 76798 && arg0 >= 0/{@ = count();} tick-1sec{printa(@); clear(@);}'

              351

              765

              729

              747

              587

             1164

              737

              839

             1074

              554
...

When tserver accepts a connection, it wakes up a thread to handle the connection (again, handling the connection means simply closing it). Incoming connections are placed in a job queue, and a worker thread removes jobs from the queue, closing each connection as it does so. Let's see how many close(2) calls are made per worker thread. The "worker" threads are the other 1000 threads. The main thread (lwpid 1) handles the accept calls and wakes up a worker thread to handle it.

# dtrace -qn 'syscall::close:return/pid == 76798 && arg0 >= 0/{@[tid] = count();} tick-1sec{printa(@); trunc(@);}'

        3                1
        6                1
       17                1
       24                1
       25                1
...
      957               12
      270               19
      778               23
      493               41
      411               52

Again, almost all of the threads call close at least once. Let's use the pid provider to take a closer look at how threads are being woken up.

# dtrace -qn 'pid76798:::entry/tid == 1/{@[probefunc] = count();} tick-1sec{printa(@); trunc(@);}'

  __lwp_park                                                        5
...
  setsockopt                                                      560
  __so_accept                                                     561
  _so_accept                                                      561
  accept                                                          561
  cond_broadcast                                                  561
...
  enqueue                                                       19494
  queue_unlink                                                  19765

Every time an accept call is made, the thread uses cond_broadcast to do the wakeup. This will cause all of the worker threads to wake up. It is a classic case of thundering herd. We'll change the code to wake up only 1 worker, (cond_signal) to see how that effects the behavior.

# prstat -c
Please wait...
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 98575 root       10M 9380K sleep   59    0   0:00:02 0.2% tserver_sig/1001
    95 root        0K    0K sleep   99  -20   0:48:09 0.1% zpool-zones/162
 16145 103        56M   46M sleep   59    0   0:43:44 0.0% beam.smp/51
  5879 root     9680K 4976K sleep   57    0   0:24:08 0.0% zoneadmd/5
 15119 nobody    128M  100M sleep   59    0   0:25:46 0.0% node/4
 16610 nobody     68M   48M sleep   59    0   0:23:00 0.0% node/3
 17119 907        45M   25M sleep   59    0   0:14:56 0.0% node/4
 15050 root       54M   33M sleep    1    0   0:12:13 0.0% node/4
 16790 nobody     59M   38M sleep   59    0   0:12:10 0.0% node/4
  3180 root       51M   29M sleep    1    0   0:12:03 0.0% node/7
 99141 root     4540K 3432K cpu1    59    0   0:00:00 0.0% prstat/1
 14837 root       43M   20M sleep    1    0   0:03:30 0.0% node/5
 14320 root       42M   21M sleep   56    0   0:03:26 0.0% node/5
 13422 root       46M   23M sleep   59    0   0:03:29 0.0% node/5
 14107 root       18M   14M sleep   58    0   0:06:36 0.0% node/6
Total: 462 processes, 3132 lwps, load averages: 0.16, 0.41, 0.80
   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP       
 98575 root       10M 9380K sleep   59    0   0:00:02 0.1% tserver_sig/1001
    95 root        0K    0K sleep   99  -20   0:48:09 0.1% zpool-zones/162
 16145 103        56M   46M sleep   59    0   0:43:44 0.0% beam.smp/51
 16610 nobody     69M   49M sleep   59    0   0:23:00 0.0% node/3
 15119 nobody    128M  100M sleep   59    0   0:25:46 0.0% node/4
  5879 root     9680K 4976K sleep   57    0   0:24:08 0.0% zoneadmd/5
 16790 nobody     59M   38M sleep   59    0   0:12:10 0.0% node/4
 17119 907        45M   25M sleep   59    0   0:14:56 0.0% node/4
 15050 root       54M   33M sleep    1    0   0:12:13 0.0% node/4
  3180 root       51M   30M sleep    1    0   0:12:03 0.0% node/7
 99141 root     4684K 3568K cpu1    59    0   0:00:00 0.0% prstat/1
 14837 root       43M   20M sleep    1    0   0:03:30 0.0% node/5
 14320 root       42M   21M sleep   56    0   0:03:26 0.0% node/5
 14107 root       18M   14M sleep   58    0   0:06:36 0.0% node/6
 14537 root       43M   22M sleep    1    0   0:03:32 0.0% node/5
Total: 462 processes, 3132 lwps, load averages: 0.15, 0.41, 0.79

The main thing to note in the above output is that now the server is using 0.1% of the cpu, as opposed to ~3.0% in the broadcast version, more than an order of magnitude in reduced resource usage.

The source code for the client and servers are here. Of course, use at your own risk.

:

Sign up Now for Instant Cloud Access

Get Started

View PricingSee Benchmarks