A 2000x Performance Win with DTrace Analytics

Note: This post originally appeared on the DTrace blog of @brendangregg

    

I recently helped analyze a performance issue in an unexpected but common place, where the fix improved performance of a task by around 2000x (two thousand times faster). As this is short, interesting and useful, I’ve reproduced it here in a lab environment to share details and screenshots.

Issue

In a production SmartOS cloud environment, a script is used to count entries in an active log. This script is executed from cron(1M), and the grep(1) command it uses is running very slowly indeed:

# time grep done log | wc -l1492751real    8m56.062suser    8m12.275ssys     0m0.218s# ls -l log-rw-r--r-- 1 root root 18M 2011-12-08 21:59 log

That took almost nine minutes. The grep(1) command is processing this file at about 34 Kbytes/sec, which seems awfully slow. Dropping the “wc -l” and using “grep -c” instead didn’t make much difference.

Hypothesis

I have a method for attacking performance issues (which I’ll explain in a later article); it starts with checking for errors, because they do cause performance issues and are a fast area to check.

The slow time sounds like it could be disk I/O related: either slow reads caused by disk errors, or random disk reads. I first checked the error counts:

# iostat -En | grep Hardsd0              Soft Errors: 0 Hard Errors: 0 Transport Errors: 0sd1              Soft Errors: 0 Hard Errors: 0 Transport Errors: 0

Which were fine, and then for random reads:

# iostat -xnz 1            extended device statisticsr/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device0.2    0.3    1.3    1.0  0.0  0.0    0.0    0.0   0   0 ramdisk112.3   81.6  355.4 3005.9  0.0  0.1    0.0    0.7   0   2 sd1            extended device statisticsr/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device0.0    7.0    0.0  245.7  0.0  0.0    0.0    0.1   0   0 sd1            extended device statisticsr/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device            extended device statisticsr/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device0.0    4.0    0.0   59.5  0.0  0.0    0.0    0.1   0   0 sd1[...]

Which are not present. This doesn’t look disk related.

Analysis

A great way to see what threads are doing is with “prstat -mLc 1″, which uses thread microstate accounting (only available on Solaris-related systems):

# prstat -mLc 1[...]PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID14566 root     100 0.0 0.0 0.0 0.0 0.0 0.0 0.0   0  13   0   0 app1/112740 root      98 0.0 0.0 0.0 0.0 0.0 0.0 2.1   1  26   8   0 grep/118998 root     3.0  12 0.0 0.0 0.0 0.0  85 0.0  45   0  5K   0 app2/117347 root     0.1 0.2 0.0 0.0 0.0 0.0 100 0.0  16   0 170   0 prstat/129693 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0  13   0 128   0 sshd/112741 root     0.0 0.0 0.0 0.0 0.0 0.0 100 0.0   7   0   7   0 wc/1[...]

The output shows grep(1) spending 98% of its time in USR (running in user-land), and 2.1% in LAT (CPU dispatcher queue latency – ready to run). There is no high percentage of SLP (sleep) time for disk I/O, which we already suspected from using iostat(1) earlier. (This lab system is also running a couple of dummy applications to simulate production load.)

Hot CPU issues are some of the easiest to solve, usually through profiling. Here, I’ve used DTrace to sample grep(1) user-land stack traces at 997 Hertz. I ran this one-liner for several seconds, then hit Ctrl-C to stop sampling and print the report:

# dtrace -n 'profile-997 /execname == "grep"/ { @[ustack()] = count(); }'dtrace: description 'profile-997 ' matched 1 probe^C[...]      libc.so.1`_UTF8_mbrtowc      grep`check_multibyte_string+0x85      grep`EGexecute+0xa6      grep`grepbuf+0xf2      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c      406      libc.so.1`mbrtowc+0x1d      grep`EGexecute+0xa6      grep`grepbuf+0xf2      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c      498      grep`check_multibyte_string+0xd5      grep`EGexecute+0xa6      grep`grepbuf+0xf2      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c     1023

This shows codepaths that were sampled. At the top is the on-CPU function, then its ancestory. The number is the count of samples for that stack trace.

The hot functions included check_multibyte_string(), mbrtowc() and _UTF8_mbrtowc(). check_multibyte_string() is declared in grep(1) (prefixed with “grep`”), and mbrtowc() is from libc. I’ll start with mbrtowc() – given that it’s in libc, it probably has a man page.

# man mbrtowcmbrtowc(3C)              Standard C Library Functions              mbrtowc(3C)NAMEmbrtowc - convert a character to a wide-character code (restartable)[...]

Wide character conversion?

Hypothesis #2

The team I was working with suggested this could be locale related:

# echo $LANGen_US.UTF-8

I’d also punched “grep mbrtowc” into google and found grep – Bugs: bug #14472, grep is slow in multibyte locales [Savannah]“. But that’s the GNU version of grep(1), which we didn’t think we were using.

# which grep/opt/local/bin/grep# grep --versionGNU grep 2.5.3[...]

Ah. So we are using GNU grep(1), and we have LANG set to en_US.UTF-8.

Experiment & Fix

# time LANG=C grep done log | wc -l1492751real    0m0.304suser    0m0.361ssys     0m0.057s

That’s over 1700x faster.

And using the SmartOS grep(1):

# time /usr/bin/grep done log | wc -l1492751real    0m0.268suser    0m0.318ssys     0m0.059s

That’s 2000x faster.

On an idle system, these improvements are between 100x – 200x (still huge). It seems that with production load, the long running grep(1) process takes even longer than usual. I’d suspect this was due in part to thread priorities and the kernel scheduler, giving the longer-running grep(1) a lower priority and less CPU time per-second than the short-lived grep(1). (If I needed to confirm this, I can examine the kernel scheduling events and priorities using DTrace.)

Cause

Both PATH and LANG had changed in the new systems, causing the GNU grep to be executed with a non-C locale. The environment was set is these locations:

# grep LANG= /etc/default/initLANG=en_US.UTF-8# grep PATH= ~/.profile MANPATH=/opt/local/man:/usr/share/manPATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin# grep PATH= /etc/default/cronPATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin:/usr/sfw/binSUPATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/usr/sbin:/usr/sfw/bin

Happy grep(1)

In case you are curious, here’s what GNU grep is doing when LANG=C, also profiled with DTrace:

[...]      grep`prtext+0x230      grep`grepbuf+0x51      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c        7      grep`EGexecute+0x4e5      grep`grepbuf+0xf2      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c        7      libc.so.1`mutex_lock_impl+0xf8      libc.so.1`mutex_lock+0xc      libc.so.1`cancel_safe_mutex_lock+0x9      libc.so.1`_flockget+0x1b      libc.so.1`fwrite+0x67      grep`prline+0x39a      grep`prtext+0x21d      grep`grepbuf+0x51      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c       15      libc.so.1`clear_lockbyte+0x2c      libc.so.1`mutex_unlock_queue+0x43      libc.so.1`mutex_unlock+0x12c      libc.so.1`cancel_safe_mutex_unlock+0x1f      libc.so.1`fwrite+0xa2      grep`prline+0x39a      grep`prtext+0x21d      grep`grepbuf+0x51      grep`grepfile+0x4f3      grep`main+0x1157      grep`_start+0x6c       18

Much lower sample counts (since it ended faster), and no mbrtowc(). The hottest two stacks show lock activity (which is where I’d start if I wanted to improve GNU grep(1) performance further).

A bit faster

Using LANG=C and grep(1)’s inbuilt “-c” to do counting instead of “wc -l”, makes it run even faster. Usually faster than 0.2s, and as fast as 0.12s:

# time LANG=C grep -c done log1492751real    0m0.120suser    0m0.110ssys     0m0.010s

Over 4000x.

Conclusion

Picking a non-C locale can hurt performance – something that has been known for many years. In this case, a GNU grep(1) bug inflated the translation overhead to slow down performance by a huge degree: up to 2000x. In short:

Leave LANG=C.

Aim DTrace at everything – even grep(1).



Post written by alexsalkever