Bruning Questions: DTracing Hardware Cache Counters

March 01, 2013 - by Mr. Max Bruning

ask-mr-bruning-logo

Carlos Cardenas has been working on porting a program from Linux that is CPU bound. He found some hotspots using DTrace. He asked someone what else he could do to maximize performance and was asked in turn if the program was cache friendly. Carlos went on to ask if I could provide some insight into the cpc provider in DTrace. I'll do that, concentrating on examining cache utilization.

John Haslam explains a bit about CPU performance counters (what the cpc provider traces) at https://blogs.oracle.com/jonh/entry/performance\_counter\_generic\_events and also gives a good introduction to the cpc provider at https://blogs.oracle.com/jonh/entry/finally\_dtrace\_meets\_the\_cpu. There are some one-liner examples in the DTrace book (see (link: http://www.dtracebook.com/index.php/Applications) ), but there are not many examples of using the provider, as Carlos mentions.

The cpustat(1m) command can be used to find out what cpc counters are available on the system.

# cpustat -h
Usage:
   cpustat [-c events] [-p period] [-nstD] [-T d|u] [interval [count]]
...
      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 45
...
       Generic Events:

 event0:  PAPI_tot_cyc PAPI_tot_ins 

     event1:  PAPI_tot_cyc PAPI_tot_ins 

     event2:  PAPI_tot_cyc PAPI_tot_ins 

     event3:  PAPI_tot_cyc PAPI_tot_ins 

     event4:  PAPI_tot_ins 
...

But here is the same command on another system:

# cpustat -h
Usage:
 cpustat [-c events] [-p period] [-nstD] [-T d|u] [interval [count]]
...
      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 37
...
       Generic Events:

 event0:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dca 
                 PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica 
                 PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm 
                 PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca 
                 PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw 
                 PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins 
                 PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im 

        event1:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dca 
                 PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica 
                 PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm 
                 PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca 
                 PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw 
                 PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins 
                 PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im 

        event2:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dcm 
                 PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich 
                 PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm 
                 PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch 
                 PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca 
                 PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm 
                 PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im 

    event3:  PAPI_br_cn PAPI_hw_int PAPI_tot_iis PAPI_l1_dcm 
                 PAPI_l1_dcr PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich 
                 PAPI_l1_icm PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm 
                 PAPI_l2_ldm PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch 
                 PAPI_l2_tcm PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca 
                 PAPI_l3_tcm PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm 
                 PAPI_sr_ins PAPI_tlb_dm PAPI_tlb_im 

    See generic_events(3CPC) for descriptions of these events
...

So, two different machines, and very different output. The first machine only showed counters for total cycles and total instructions. Nothing about caches. The second machine had many more counters, including counters on cache misses, cache accesses, cache hits, cache stores, for both data and instruction caches. See generic_events(3CPC) for a description of the PAPI_xxx events.

The first machine has an Intel Sandy Bridge chipset. It turns out there are cpc kernel modules missing for this chip. At Joyent, we have an internal bug filed for the missing modules.

Here is output from a third machine:

# cpustat -h
...
      CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 26
...
 event0:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int 
                 PAPI_tot_iis PAPI_l1_dca PAPI_l1_dcm PAPI_l1_dcr 
                 PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm 
                 PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm 
                 PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm 
                 PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm 
                 PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins 
                 PAPI_tlb_dm PAPI_tlb_im 

     event1:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int 
                 PAPI_tot_iis PAPI_l1_dca PAPI_l1_dcm PAPI_l1_dcr 
                 PAPI_l1_dcw PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm 
                 PAPI_l1_icr PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm 
                 PAPI_l2_stm PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm 
                 PAPI_l2_tcr PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm 
                 PAPI_ld_ins PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins 
                 PAPI_tlb_dm PAPI_tlb_im 

     event2:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int 
                 PAPI_tot_iis PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw 
                 PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr 
                 PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm 
                 PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr 
                 PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins 
                 PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm 
                 PAPI_tlb_im 

 event3:  PAPI_tot_cyc PAPI_tot_ins PAPI_br_cn PAPI_hw_int 
                 PAPI_tot_iis PAPI_l1_dcm PAPI_l1_dcr PAPI_l1_dcw 
                 PAPI_l1_ica PAPI_l1_ich PAPI_l1_icm PAPI_l1_icr 
                 PAPI_l1_ldm PAPI_l1_tcm PAPI_l2_ldm PAPI_l2_stm 
                 PAPI_l2_tca PAPI_l2_tch PAPI_l2_tcm PAPI_l2_tcr 
                 PAPI_l2_tcw PAPI_l3_tca PAPI_l3_tcm PAPI_ld_ins 
                 PAPI_lst_ins PAPI_prf_dm PAPI_sr_ins PAPI_tlb_dm 
                 PAPI_tlb_im 

 event4:  PAPI_tot_ins 

  event5:  PAPI_tot_cyc 

  event6:  

       See generic_events(3CPC) for descriptions of these events

You can see from the output, specifically:

CPU performance counter interface: Intel Arch PerfMon v3 on Family 6 Model 37

Note that the "Model" number is different on all 3 systems.

We'll concentrate on the third system, specifically "Family 6 Model 26". The psrinfo command gives a bit more information.

# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (1 3 5 7 9 11 13 15)
  The core has 2 virtual processors (1 9)
  The core has 2 virtual processors (3 11)
  The core has 2 virtual processors (5 13)
  The core has 2 virtual processors (7 15)
    x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz)
      Intel(r) Xeon(r) CPU           E5530  @ 2.40GHz
The physical processor has 4 cores and 8 virtual processors (0 2 4 6 8 10 12 14)
  The core has 2 virtual processors (0 8)
  The core has 2 virtual processors (2 10)
  The core has 2 virtual processors (4 12)
  The core has 2 virtual processors (6 14)
    x86 (GenuineIntel 106A5 family 6 model 26 step 5 clock 2400 MHz)
      Intel(r) Xeon(r) CPU           E5530  @ 2.40GHz
#

More information on the processors can be found at: http://ark.intel.com/products/37103?wapkw=e5530.

There are several CPU performance counters on this machine that deal with cacheing. These are:

     PAPI_l1_dca     Level 1 data cache accesses

     PAPI_l1_dcm     Level 1 data cache misses

     PAPI_l1_dcr     Level 1 data cache reads

     PAPI_l1_dcw     Level 1 data cache writes

     PAPI_l1_ica     Level 1 instruction cache accesses

     PAPI_l1_ich     Level 1 instruction cache hits

     PAPI_l1_icm     Level 1 instruction cache misses

     PAPI_l1_icr     Level 1 instruction cache reads

     PAPI_l1_ldm     Level 1 cache load misses

     PAPI_l1_tcm     Level 1 cache misses
...

There are also counters for level 2 and level 3 caches, and for tlb utilization.

To demonstrate the use of these counters with DTrace, we'll use them to examine instruction cache utilization for a simple program.

/* loop.c */
int
main(int argc, char *argv[])
{
    volatile int i;

    for (i = 0; i < 1000000; i++)
        asm("nop");
}

As you can see, this is a simple program which executes 1 million "nop" instructions. We'll compile this with gcc and with no optimization flags.

# gcc loop.c -o loop
#

And run it with ptime(1):

# ptime ./loop

real        0.007110594
user        0.004870653
sys         0.001894433
#

Let's take a quick look at the assembler code produced by the compiler for this program.

# gcc -S loop.c
# cat loop.s
...
main:
.LFB0:
    pushq    %rbp
.LCFI0:
    movq    %rsp, %rbp
.LCFI1:
    movl    %edi, -20(%rbp)
    movq    %rsi, -32(%rbp)
    movl    $0, -4(%rbp)
    jmp    .L2
.L3:
/APP
/ 9 "loop.c" 1
    nop

/ 0 "" 2
/NO_APP
    movl    -4(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -4(%rbp)
.L2:
    movl    -4(%rbp), %eax
    cmpl    $999999999, %eax
    jle    .L3
    popq    %rbp
.LCFI2:
    ret
...

So, the 7 instructions starting at the "nop" and ending at "jle.L3" are executed 1 million times.

What if we "unwind" the loop (see (link: www.wikipedia.org/wiki/Loop_unwinding) ) for a description of loop unwinding (also called loop unrolling). This might be done via compiler optimization, or by hand. By hand, the code might look like:

    movl    -4(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -4(%rbp)
    nop
    movl    -4(%rbp), %eax
    addl    $1, %eax
    movl    %eax, -4(%rbp)
    nop
...

Where the 4 instructions (movl, addl, movl, and nop) are repeated a million times. The version with the "for" loop executes 7 million instructions. With the loop unwound, there are 4 million instructions executed. Let's time the unwound version.

# ptime ./unwound

real        0.015658850
user        0.008557959
sys         0.006777156

So, the looping version runs 2 times faster than the non looping version, yet executes almost twice as many instructions. (In case you hadn't noticed, the extra instructions are the comparison and branch to see if the loop should terminate).

Why is the version with fewer instructions slower? We'll assume that it is due to cache utilization. Most likely, the version with the for loop hits the level 1 instruction cache almost always, where the unrolled version probably misses the L1 instruction cache frequently. We'll also assume that both programs fit into the ZFS ARC (the loop version is 10384 bytes large, and the unwound version is 10010368 bytes). If we're missing the ARC, the difference in times would be significantly greater than 200%.

Let's DTrace instruction cache misses for the two versions.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m = count();} END{printa("misses = %@d\n", @m);}' -c "./loop"
misses = 1

#

The PAPI_l1_icm-user-5000 probe fires every time the cpc counter for instruction cache misses fires while executing user level code. The probe fires for every 5000 misses. Let's do the same for the unwound version.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m = count();} END{printa("misses = %@d\n", @m);}' -c "./unwound"
misses = 61

You can experiment with the other counters. The PAPI_br_cn counter fires much more often in the loop version, as should be expected (this is a counter for conditional branches).

What if we want to see where in the code that branch misses are occuring?

Like the profile provider, arg0 is the kernel address and arg1 is the user address. We could also aggregate using ustack() as a key.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[arg1] = count();} END{printa("misses at %p = %@d\n", @m);}' -c "./unwound"
misses at 415a67 = 1
misses at 43cc43 = 1
misses at 462547 = 1
misses at 48a9a7 = 1
...
misses at d3bdcb = 1
misses at d63360 = 1
misses at d89244 = 1
#

Using mdb, we can find where we are in the code:

# mdb unwound
> d63360/ai
main+0x962498:  nop    
>

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[arg1] = count();} END{printa("misses at %p = %@d\n", @m);}' -c "./loop"
misses at fffffd7fff3b0441 = 1
#

# mdb loop
> fffffd7fff3b0441/i
mdb: failed to read instruction at 0xfffffd7fff3b0441: no mapping for address
>
> main::bp
> :r
mdb: stop at main
mdb: target stopped at:
main:           pushq  %rbp
> fffffd7fff3b0441/i
ld.so.1`elf_rtbndr+1:           movq   %rsp,%rbp
>

Or use the ufunc() function to find the routine.

# dtrace -q -n 'cpc:::PAPI_l1_icm-user-5000/pid == $target/{@m[ufunc(arg1)] = count();}' -c "./unwound"
  unwound`main                                                      61
#

Can the loop be made faster? Probably with loop unwinding, but instead of unwinding the entire loop, maybe by unwinding a section of the loop:

/* fastloop.c */

int
main(int argc, char *argv[])
{
    volatile int i;

    for (i = 0; i < (1000000/10); i++) {
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
        asm("nop");
    }
}

Compile and run results in:

# ptime ./fastloop

real        0.003061182
user        0.000777798
sys         0.001979295

This version is over twice as fast as the previous version. It works better doing 1/10th the loops, but 10 times more nop's per loop. What if we reduced by 100? That is left as an experiment. At some point, depending on cache line size, we'll start to see a reduction due to cache misses.

Cache line information can be retrieved using prtpicl:

# prtpicl -v | egrep -i 'cache|tlb'
              :dtlb0-2M-4M-associativity        0x4 
              :dtlb0-2M-4M-size     0x20 
              :dtlb-4K-associativity       0x4 
              :dtlb-4K-size         0x40 
              :itlb-2M-4M-associativity    0 
              :itlb-2M-4M-size        0x7 
              :l3-cache-associativity       0x10 
              :l3-cache-line-size  0x40 
              :l3-cache-size       0x800000 
              :itlb-4K-associativity   0x4 
              :itlb-4K-size         0x40 
              :l1-dcache-associativity     0x8 
              :l1-dcache-line-size  0x40 
              :l1-dcache-size      0x8000 
              :shared-l2-tlb-4k-associativity    0x4 
              :shared-l2-tlb-4k-size        0x200 
...

The exact output will depend on the processor you are using. On this machine, the L1 data cache has a line size of 64 (decimal) bytes. The nop instruction is a 1 byte instruction, so 64 nop's could fit in one cache line. Of course, there is no guarantee that the nop's are aligned in the code in such a way that the first nop is at the beginning of the line. Regardless, 64 nop's might be a good starting point to try to get best cache performance (and adjusting the loop count accordingly). If the loop contained other code, you could use mdb to get an idea of how much loop unwinding might help. To do this, disassemble the loop and see the address where the loop begins and ends. (Left as an exercise...).

I should also mention that using the cpc counters to do performance analysis should only be done when you have exhausted all other possibilities. This means that the disk and network are not involved, the process is compute bound and not waiting/spinning for mutexes or waiting for other threads, and you have done normal profiling of your application.

We are opening this and future blogs to comments, so hopefully I'll hear what people think, and I can answer questions.

You can also submit your DTrace, MDB, or SmartOS questions to be answered in a future column by emailing MrBruning@joyent.com, ask at #BruningQuestions, or attend one of my upcoming courses.

:

Sign up now for Instant Cloud Access Get Started