DTracing Hardware Cache Counters

Carlos Cardenas has been working on porting a program from Linux that is CPU bound. He found some hotspots using DTrace. He asked someone what else he could do to maximize performance and was asked in turn if the program was cache friendly. Carlos went on to ask if I could provide some insight into the cpc provider in DTrace. I'll do that, concentrating on examining cache utilization.

John Haslam explains a bit about CPU performance counters (what the cpc provider traces) at blogs.oracle.com/jonh/entry/performance_counter_generic_events and also gives a good introduction to the cpc provider at blogs.oracle.com/jonh/entry/finally_dtrace_meets_the_cpu. There are some one-liner examples in the DTrace book (see dtracebook.com/index.php/Applications ), but there are not many examples of using the provider, as Carlos mentions.

The cpustat(1m) command can be used to find out what cpc counters are available on the system.

# cpustat -hUsage:   cpustat [-c events] [-p period] [-nstD] [-T d|u] [interval [count]]...      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 45...       Generic Events: event0:  PAPI_tot_cyc PAPI_tot_ins     event1:  PAPI_tot_cyc PAPI_tot_ins     event2:  PAPI_tot_cyc PAPI_tot_ins     event3:  PAPI_tot_cyc PAPI_tot_ins     event4:  PAPI_tot_ins...

But here is the same command on another system:

# cpustat -hUsage: cpustat [-c events] [-p period] [-nstD] [-T d|u] [interval [count]]...      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 37...       Generic Events: event0:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dca                 PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica                 PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm                 PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca                 PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw                 PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins                 PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im        event1:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dca                 PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica                 PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm                 PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca                 PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw                 PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins                 PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im        event2:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dcm                 PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich                 PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm                 PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch                 PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca                 PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm                 PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im    event3:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dcm                 PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich                 PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm                 PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch                 PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca                 PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm                 PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im    See generic_events(3CPC) for descriptions of these events...

So, two different machines, and very different output. The first machine only showed counters for total cycles and total instructions. Nothing about caches. The second machine had many more counters, including counters on cache misses, cache accesses, cache hits, cache stores, for both data and instruction caches. See generic_events(3CPC) for a description of the PAPI_xxx events.

The first machine has an Intel Sandy Bridge chipset. It turns out there are cpc kernel modules missing for this chip. At Joyent, we have an internal bug filed for the missing modules.

Here is output from a third machine:

# cpustat -h...      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 26... event0:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int                 PAPI_tot_iis PAPI_l1_dca PAPI_l1_dcm PAPI_l1_dcr                 PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm                 PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm                 PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm                 PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm                 PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins                 PAPI_tlb_dm PAPI_tlb_im     event1:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int                 PAPI_tot_iis PAPI_l1_dca PAPI_l1_dcm PAPI_l1_dcr                 PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm                 PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm                 PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm                 PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm                 PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins                 PAPI_tlb_dm PAPI_tlb_im     event2:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int                 PAPI_tot_iis PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw                 PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr                 PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm                 PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr                 PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins                 PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm                 PAPI_tlb_im event3:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int                 PAPI_tot_iis PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw                 PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr                 PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm                 PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr                 PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins                 PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm                 PAPI_tlb_im event4:  PAPI_tot_ins  event5:  PAPI_tot_cyc  event6:       See generic_events(3CPC) for descriptions of these events

You can see from the output, specifically:

CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 37

Note that the "Model" number is different on all 3 systems.

We'll concentrate on the third system, specifically "Family 6 Model 26". The psrinfo command gives a bit more information.

# psrinfo -vpThe physical processor has 4 cores and 8 virtual processors (1 3 5 7 9 11 13 15)  The core has 2 virtual processors (1 9)  The core has 2 virtual processors (3 11)  The core has 2 virtual processors (5 13)  The core has 2 virtual processors (7 15)    x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz)      Intel(r) Xeon(r) CPU           E5530  @ 2.40GHzThe physical processor has 4 cores and 8 virtual processors (0 2 4 6 8 10 12 14)  The core has 2 virtual processors (0 8)  The core has 2 virtual processors (2 10)  The core has 2 virtual processors (4 12)  The core has 2 virtual processors (6 14)    x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz)      Intel(r) Xeon(r) CPU           E5530  @ 2.40GHz#

More information on the processors can be found at: ark.intel.com/products/37103?wapkw=e5530.

There are several CPU performance counters on this machine that deal with cacheing. These are:

     PAPI_l1_dca     Level 1 data cache accesses     PAPI_l1_dcm     Level 1 data cache misses     PAPI_l1_dcr     Level 1 data cache reads     PAPI_l1_dcw     Level 1 data cache writes     PAPI_l1_ica     Level 1 instruction cache accesses     PAPI_l1_ich     Level 1 instruction cache hits     PAPI_l1_icm     Level 1 instruction cache misses     PAPI_l1_icr     Level 1 instruction cache reads     PAPI_l1_ldm     Level 1 cache load misses     PAPI_l1_tcm     Level 1 cache misses...

There are also counters for level 2 and level 3 caches, and for tlb utilization.

To demonstrate the use of these counters with DTrace, we'll use them to examine instruction cache utilization for a simple program.

/* loop.c */intmain(int argc, char *argv[]){    volatile int i;    for (i = 0; i < 1000000; i++)        asm("nop");}

As you can see, this is a simple program which executes 1 million "nop" instructions. We'll compile this with gcc and with no optimization flags.

# gcc loop.c -o loop#

And run it with ptime(1):

# ptime ./loopreal        0.007110594user        0.004870653sys         0.001894433#

Let's take a quick look at the assembler code produced by the compiler for this program.

# gcc -S loop.c# cat loop.s...main:.LFB0:    pushq   %rbp.LCFI0:    movq    %rsp, %rbp.LCFI1:    movl    %edi, -20(%rbp)    movq    %rsi, -32(%rbp)    movl    $0, -4(%rbp)    jmp .L2.L3:/APP/ 9 "loop.c" 1    nop/ 0 "" 2/NO_APP    movl    -4(%rbp), %eax    addl    $1, %eax    movl    %eax, -4(%rbp).L2:    movl    -4(%rbp), %eax    cmpl    $999999999, %eax    jle .L3    popq    %rbp.LCFI2:    ret...

So, the 7 instructions starting at the "nop" and ending at "jle.L3" are executed 1 million times.

What if we "unwind" the loop (see wikipedia.org/wiki/Loop_unwinding ) for a description of loop unwinding (also called loop unrolling). This might be done via compiler optimization, or by hand. By hand, the code might look like:

    movl    -4(%rbp), %eax    addl    $1, %eax    movl    %eax, -4(%rbp)    nop    movl    -4(%rbp), %eax    addl    $1, %eax    movl    %eax, -4(%rbp)    nop...

Where the 4 instructions (movl, addl, movl, and nop) are repeated a million times. The version with the "for" loop executes 7 million instructions. With the loop unwound, there are 4 million instructions executed. Let's time the unwound version.

# ptime ./unwoundreal        0.015658850user        0.008557959sys         0.006777156

So, the looping version runs 2 times faster than the non looping version, yet executes almost twice as many instructions. (In case you hadn't noticed, the extra instructions are the comparison and branch to see if the loop should terminate).

Why is the version with fewer instructions slower? We'll assume that it is due to cache utilization. Most likely, the version with the for loop hits the level 1 instruction cache almost always, where the unrolled version probably misses the L1 instruction cache frequently. We'll also assume that both programs fit into the ZFS ARC (the loop version is 10384 bytes large, and the unwound version is 10010368 bytes). If we're missing the ARC, the difference in times would be significantly greater than 200%.

Let's DTrace instruction cache misses for the two versions.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m = count();} END{printa("misses = %@d\n", @m);}' -c "./loop"misses = 1#

The PAPI_l1_icm-user-5000 probe fires every time the cpc counter for instruction cache misses fires while executing user level code. The probe fires for every 5000 misses. Let's do the same for the unwound version.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m = count();} END{printa("misses = %@d\n", @m);}' -c "./unwound"misses = 61

You can experiment with the other counters. The PAPI_br_cn counter fires much more often in the loop version, as should be expected (this is a counter for conditional branches).

What if we want to see where in the code that branch misses are occuring?

Like the profile provider, arg0 is the kernel address and arg1 is the user address. We could also aggregate using ustack() as a key.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[arg1] = count();} END{printa("misses at %p = %@d\n", @m);}' -c "./unwound"misses at 415a67 = 1misses at 43cc43 = 1misses at 462547 = 1misses at 48a9a7 = 1...misses at d3bdcb = 1misses at d63360 = 1misses at d89244 = 1#

Using mdb, we can find where we are in the code:

# mdb unwound> d63360/aimain+0x962498:  nop># dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[arg1] = count();} END{printa("misses at %p = %@d\n", @m);}' -c "./loop"misses at fffffd7fff3b0441 = 1## mdb loop> fffffd7fff3b0441/imdb: failed to read instruction at 0xfffffd7fff3b0441: no mapping for address>> main::bp> :rmdb: stop at mainmdb: target stopped at:main:           pushq  %rbp> fffffd7fff3b0441/ild.so.1`elf_rtbndr+1:           movq   %rsp,%rbp>

Or use the ufunc() function to find the routine.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[ufunc(arg1)] = count();}' -c "./unwound"  unwound`main                                                      61#

Can the loop be made faster? Probably with loop unwinding, but instead of unwinding the entire loop, maybe by unwinding a section of the loop:

/* fastloop.c */intmain(int argc, char *argv[]){    volatile int i;    for (i = 0; i < (1000000/10); i++) {        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");        asm("nop");    }}

Compile and run results in:

# ptime ./fastloopreal        0.003061182user        0.000777798sys         0.001979295

This version is over twice as fast as the previous version. It works better doing 1/10th the loops, but 10 times more nop's per loop. What if we reduced by 100? That is left as an experiment. At some point, depending on cache line size, we'll start to see a reduction due to cache misses.

Cache line information can be retrieved using prtpicl:

# prtpicl -v | egrep -i 'cache|tlb'              :dtlb0-2M-4M-associativity        0x4              :dtlb0-2M-4M-size     0x20              :dtlb-4K-associativity       0x4              :dtlb-4K-size         0x40              :itlb-2M-4M-associativity    0              :itlb-2M-4M-size        0x7              :l3-cache-associativity       0x10              :l3-cache-line-size  0x40              :l3-cache-size       0x800000              :itlb-4K-associativity   0x4              :itlb-4K-size         0x40              :l1-dcache-associativity     0x8              :l1-dcache-line-size  0x40              :l1-dcache-size      0x8000              :shared-l2-tlb-4k-associativity    0x4              :shared-l2-tlb-4k-size        0x200...

The exact output will depend on the processor you are using. On this machine, the L1 data cache has a line size of 64 (decimal) bytes. The nop instruction is a 1 byte instruction, so 64 nop's could fit in one cache line. Of course, there is no guarantee that the nop's are aligned in the code in such a way that the first nop is at the beginning of the line. Regardless, 64 nop's might be a good starting point to try to get best cache performance (and adjusting the loop count accordingly). If the loop contained other code, you could use mdb to get an idea of how much loop unwinding might help. To do this, disassemble the loop and see the address where the loop begins and ends. (Left as an exercise...).

I should also mention that using the cpc counters to do performance analysis should only be done when you have exhausted all other possibilities. This means that the disk and network are not involved, the process is compute bound and not waiting/spinning for mutexes or waiting for other threads, and you have done normal profiling of your application.

We are opening this and future blogs to comments, so hopefully I'll hear what people think,and I can answer questions.

We offer comprehensive training for Triton Developers, Operators and End Users.



Post written by Mr. Max Bruning