Bruning Questions: DTrace Caller Builtin

July 30, 2013 - by Mr. Max Bruning

This post answers a question I occasionally have when I am DTracing code. How do I trace a function but only if it is called from a specific other function? This post shows how to do it when tracing the kernel. Doing the same thing for user level code will be covered in a future blog post.

I am working on a blog post about ZFS transactions. As part of learning more about transactions, I wanted to see arguments passed to txg_list_add(). This function basically adds objects to a transaction group. The function can be found here.

Here it is:

/*
* Add an entry to the list (unless it's already on the list).
* Returns B_TRUE if it was actually added.
*/
boolean_t
txg_list_add(txg_list_t *tl, void *p, uint64_t txg)
{
    int t = txg & TXG_MASK;
    txg_node_t *tn = (txg_node_t *)((char *)p + tl->tl_offset);
    boolean_t add;

    mutex_enter(&tl->tl_lock);
    add = (tn->tn_member[t] == 0);
    if (add) {
        tn->tn_member[t] = 1;
        tn->tn_next[t] = tl->tl_head[t];
        tl->tl_head[t] = tn;
    }
    mutex_exit(&tl->tl_lock);

    return (add);
}

So, the first argument is the list, the second argument is the object being added to the list, and the third argument is the transaction group id. Depending on the list, the type of the second argument varies (hence the declaration to be void). A problem I ran into was how to tell what the second argument is when I am DTrace-ing the function. First, let's take a look at where the function gets called from during a creat(2) system call.

Here's a D script to do that:

#!/usr/sbin/dtrace -qs
BEGIN
{
    txg = 0;
}

zfs_create:entry /* creat(2) calls zfs_create() for zfs file creation */
{
    self->in=1;
}

fbt::txg_list_add:entry
/self->in && txg == 0/
{
    txg = arg2;
    printf("%s: called from %a\n", probefunc, caller);
}

txg_list_add:entry
/arg2 == txg/
{
    printf("%s: called from %a\n", probefunc, caller);
}

zfs_create:return
/self->in/
{
    self->in = 0;
}

(Note: the first txg_list_add will get printed twice). Why not use self->txg = arg2? Because this function is called from other threads for the same transaction group id. I want to capture all of them from the creat(2) call onward. Note that it is possible that some objects are added to the same transaction group prior to the creat(2) call, but I should catch all added after the create.

Here's the output from the script:

# chmod +x txg.d
# ./txg.d

in another window, I run _touch foo_.

txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`zilog_dirty+0x53
txg_list_add: called from zfs`zilog_dirty+0x53
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dir_dirty+0x2e
txg_list_add: called from zfs`dsl_dataset_dirty+0x41
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_dirty+0x92
txg_list_add: called from zfs`vdev_dirty+0x68
txg_list_add: called from zfs`vdev_sync+0x91
txg_list_add: called from zfs`vdev_sync+0x91
txg_list_add: called from zfs`vdev_sync+0x91
txg_list_add: called from zfs`vdev_sync+0xdd
(^c)
#

So, for a given transaction group on a creat(2) system call, txg_list_add is called from zfs`dsl_dir_dirty+0x2e, zfs`dsl_dataset_dirty+0x41, zfs`zilog_dirty+0x53, zfs`vdev_dirty+0x92, zfs`vdev_dirty+0x68, zfs`vdev_sync_0x91, and zfs`vdev_sync+0xdd.

Note that the +hex_value is the byte offset in the function of where txg_list_add() will return when it is finished. To see that, here is the disassembled zfs`dsl_dir_dirty() function:

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs mpt idm sd crypto random cpc logindmux ptm sppp nfs ]
> dsl_dir_dirty::dis
dsl_dir_dirty:         pushq %rbp
dsl_dir_dirty+1:        movq  %rsp,%rbp
dsl_dir_dirty+4:        subq  $0x20,%rsp
dsl_dir_dirty+8:        movq  %rbx,-0x18(%rbp)
dsl_dir_dirty+0xc:       movq  %rsi,-0x10(%rbp)
dsl_dir_dirty+0x10:       movq  %rdi,%rbx
dsl_dir_dirty+0x13:       movq  %rdi,-0x8(%rbp)
dsl_dir_dirty+0x17:       movq  0x18(%rdi),%rdi
dsl_dir_dirty+0x1b:       movq  0x38(%rsi),%rdx
dsl_dir_dirty+0x1f:       movq  %rbx,%rsi
dsl_dir_dirty+0x22:       addq  $0x258,%rdi
dsl_dir_dirty+0x29:       call  +0x32062 
dsl_dir_dirty+0x2e:       testl %eax,%eax
dsl_dir_dirty+0x30:       je   +0xc   
dsl_dir_dirty+0x32:       movq  0x10(%rbx),%rdi
dsl_dir_dirty+0x36:       movq  %rbx,%rsi
dsl_dir_dirty+0x39:       call  -0x167fe 
dsl_dir_dirty+0x3e:       movq  -0x18(%rbp),%rbx
dsl_dir_dirty+0x42:       leave 
dsl_dir_dirty+0x43:       ret  
> 

You can see that the caller of txg_list_add() is dsl_dir_dirty+0x2e, which is the location of the testl %eax,%eax immediately after the call to txg_list_add(), which is at dsl_dir_dirty_0x29. Let's take a look at dsl_dir_dirty() (from here).

void
dsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx)
{
    dsl_pool_t *dp = dd->dd_pool;

    ASSERT(dd->dd_phys);

    if (txg_list_add(&dp->dp_dirty_dirs, dd, tx->tx_txg)) {
        /* up the hold count until we can be written out */
        dmu_buf_add_ref(dd->dd_dbuf, dd);
    }
}

In this case, the code is adding a dsl_dir_t to the list headed by dp_dirty_dirs, which is a member of the dsl_pool_t structure. In other places, txg_list_add() is called with different lists and objects. I want to print the object (the dsl_dir_t) that is passed into the function. I first tried:

# dtrace -n 'txg_list_add:entry/caller == zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'
dtrace: invalid probe specifier txg_list_add:entry/caller == zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}: in predicate: operands have incompatible types: "function pointer" + "int"
#

Then I tried:

# dtrace -n 'txg_list_add:entry/caller == (uintptr_t)zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'
dtrace: description 'txg_list_add:entry' matched 1 probe

But this probe never fired. I then tried printing the value of (uintptr_t)zfs`dsl_dir_dirty+0x2e.

# dtrace -n 'txg_list_add:entry{printf("caller = %x, dsl_dir_dirty+0x2e = %x\n", caller, (uintptr_t)zfs`dsl_dir_dirty+0x2e);}'
dtrace: description 'txg_list_add:entry' matched 1 probe
CPU   ID          FUNCTION:NAME
 0 48419        txg_list_add:entry caller = fffffffff7cd3dae, dsl_dir_dirty+0x2e = 20ec8348e5894883

 0 48419        txg_list_add:entry caller = fffffffff7cd3dae, dsl_dir_dirty+0x2e = 20ec8348e5894883

...

Of course, caller might not have been dsl_dir_dirty+0x2e, but none of the callers matched (which explains why the preceding DTrace one-liner did not give any output). Why didn't they match? And what is 20ec8348e5894883?

After thinking about this for a short time, I realized that zfs`dsl_dir_dirty+0x2e is being interpreted as an address, and DTrace was going to that address and picking up the contents. In this case, the contents are code.

# mdb -k
> dsl_dir_dirty,8/B
dsl_dir_dirty:
dsl_dir_dirty: 55   48   89   e5   48   83  ec   20
> 

The bytes are swapped (Intel is little endian). And when you add 0x2e plus 0x55, you get 0x83. So, instead of getting the contents, let's try the address.

# dtrace -n 'txg_list_add:entry/caller == (uintptr_t)&zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'
dtrace: description 'txg_list_add:entry' matched 1 probe
CPU   ID          FUNCTION:NAME
 0 48419        txg_list_add:entry dsl_dir_t {
  uint64_t dd_object = 0x4a
  dsl_dir_phys_t *dd_phys = 0xffffff014a61d400
  dmu_buf_t *dd_dbuf = 0xffffff014a61f650
  dsl_pool_t *dd_pool = 0xffffff014a262340
  txg_node_t dd_dirty_link = {
    struct txg_node *[4] tn_next = [ 0, 0, 0, 0 ]
    uint8_t [4] tn_member = [ 0x1, 0, 0, 0 ]
  }
  dsl_dir_t *dd_parent = 0xffffff014a296e00
  kmutex_t dd_lock = {
    void *[1] _opaque = [ 0 ]
  }
  list_t dd_prop_cbs = {
    size_t list_size = 0x30
    size_t list_offset = 0
    struct list_node list_head = {
      struct list_node *list_next = 0xffffff014ac59310
      struct list_node *list_prev = 0xffffff014926a968
    }
  }
  timestruc_t dd_snap_cmtime = {
    time_t tv_sec = 0x51f67b0d
    long tv_nsec = 0x1f8701b4
  }
  uint64_t dd_origin_txg = 0
  uint64_t [4] dd_tempreserved = [ 0, 0, 0, 0 ]
  int64_t [4] dd_space_towrite = [ 0x606000, 0, 0, 0 ]
  char [256] dd_myname = [ "var" ]
}

This works. What if I wanted to DTrace a function, but only if it was called from another specific function?

For instance, txg_list_add() is called from more than one place in zfs`vdev_dirty() (actually, if you look at the source code (see here), you'll see that vdev_dirty() has three different calls to txg_list_add()).

# dtrace -n 'txg_list_add:entry \
/caller >= (uintptr_t)&zfs`vdev_dirty && caller < (uintptr_t)&zfs`vdev_dirty+sizeof(zfs`vdev_dirty)/ \
{printf("txg_list_add called from %a\n", caller);}'
dtrace: description 'txg_list_add:entry' matched 1 probe
CPU   ID          FUNCTION:NAME
 0 48419        txg_list_add:entry txg_list_add called from zfs`vdev_dirty+0x92

 0 48419        txg_list_add:entry txg_list_add called from zfs`vdev_dirty+0x68
...

Figuring out which two of the three calls to txg_list_add() are being made from vdev_dirty() will have to be the topic of another blog. Suffice it to say, that the two shown by DTrace may not correspond to the first and second calls in the C source code (and, in fact, may be the third and second or some other combination). Careful study of the assembler code is required (or compile the code using the -S option to get annotated assembler output).

I wanted to post more about ZFS transactions, but I am still interpreting the results of tracing all of the calls to txg_list_add(). I have a script which shows the objects added to the transaction group list, but I am (so far), underwhelmed with what they tell me about the transactions. Consider this a work in progress. One thing I can see is that most of the calls from a specific caller are passing in almost identical objects each time. Since I wasn't sure what I expected, I need to do more digging.

:

Sign up Now for Instant Cloud Access

Get Started

View PricingSee Benchmarks