DTrace Caller Builtin

This post answers a question I occasionally have when I am DTracing code. How do I trace a function but only if it is called from a specific other function? This post shows how to do it when tracing the kernel. Doing the same thing for user level code will be covered in a future blog post.

I am working on a blog post about ZFS transactions. As part of learning more about transactions, I wanted to see arguments passed to txg_list_add(). This function basically adds objects to a transaction group. The function can be found here.

Here it is:

/** Add an entry to the list (unless it's already on the list).* Returns B_TRUE if it was actually added.*/boolean_ttxg_list_add(txg_list_t *tl, void *p, uint64_t txg){    int t = txg & TXG_MASK;    txg_node_t *tn = (txg_node_t *)((char *)p + tl->tl_offset);    boolean_t add;    mutex_enter(&tl->tl_lock);    add = (tn->tn_member[t] == 0);    if (add) {        tn->tn_member[t] = 1;        tn->tn_next[t] = tl->tl_head[t];        tl->tl_head[t] = tn;    }    mutex_exit(&tl->tl_lock);    return (add);}

So, the first argument is the list, the second argument is the object being added to the list, and the third argument is the transaction group id. Depending on the list, the type of the second argument varies (hence the declaration to be void). A problem I ran into was how to tell what the second argument is when I am DTrace-ing the function. First, let's take a look at where the function gets called from during a creat(2) system call.

Here's a D script to do that:

#!/usr/sbin/dtrace -qsBEGIN{    txg = 0;}zfs_create:entry /* creat(2) calls zfs_create() for zfs file creation */{    self->in=1;}fbt::txg_list_add:entry/self->in && txg == 0/{    txg = arg2;    printf("%s: called from %a\n", probefunc, caller);}txg_list_add:entry/arg2 == txg/{    printf("%s: called from %a\n", probefunc, caller);}zfs_create:return/self->in/{    self->in = 0;}

(Note: the first txg_list_add will get printed twice). Why not use self->txg = arg2? Because this function is called from other threads for the same transaction group id. I want to capture all of them from the creat(2) call onward. Note that itis possible that some objects are added to the same transaction group prior to the creat(2) call, but I should catch all added after the create.

Here's the output from the script:

# chmod +x txg.d# ./txg.d

in another window, I run touch foo.

txg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`zilog_dirty+0x53txg_list_add: called from zfs`zilog_dirty+0x53txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dir_dirty+0x2etxg_list_add: called from zfs`dsl_dataset_dirty+0x41txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_dirty+0x92txg_list_add: called from zfs`vdev_dirty+0x68txg_list_add: called from zfs`vdev_sync+0x91txg_list_add: called from zfs`vdev_sync+0x91txg_list_add: called from zfs`vdev_sync+0x91txg_list_add: called from zfs`vdev_sync+0xdd(^c)#

So, for a given transaction group on a creat(2) system call, txg_list_add is called from zfs`dsl_dir_dirty+0x2e, zfs`dsl_dataset_dirty+0x41, zfs`zilog_dirty+0x53, zfs`vdev_dirty+0x92, zfs`vdev_dirty+0x68, zfs`vdev_sync_0x91, and zfs`vdev_sync+0xdd.

Note that the +hex_value is the byte offset in the function of where txg_list_add() will return when it is finished. To see that, here is the disassembled zfs`dsl_dir_dirty() function:

# mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs mpt idm sd crypto random cpc logindmux ptm sppp nfs ]> dsl_dir_dirty::disdsl_dir_dirty:         pushq %rbpdsl_dir_dirty+1:        movq  %rsp,%rbpdsl_dir_dirty+4:        subq  $0x20,%rspdsl_dir_dirty+8:        movq  %rbx,-0x18(%rbp)dsl_dir_dirty+0xc:       movq  %rsi,-0x10(%rbp)dsl_dir_dirty+0x10:       movq  %rdi,%rbxdsl_dir_dirty+0x13:       movq  %rdi,-0x8(%rbp)dsl_dir_dirty+0x17:       movq  0x18(%rdi),%rdidsl_dir_dirty+0x1b:       movq  0x38(%rsi),%rdxdsl_dir_dirty+0x1f:       movq  %rbx,%rsidsl_dir_dirty+0x22:       addq  $0x258,%rdidsl_dir_dirty+0x29:       call  +0x32062 dsl_dir_dirty+0x2e:       testl %eax,%eaxdsl_dir_dirty+0x30:       je   +0xc   dsl_dir_dirty+0x32:       movq  0x10(%rbx),%rdidsl_dir_dirty+0x36:       movq  %rbx,%rsidsl_dir_dirty+0x39:       call  -0x167fe dsl_dir_dirty+0x3e:       movq  -0x18(%rbp),%rbxdsl_dir_dirty+0x42:       leave dsl_dir_dirty+0x43:       ret  > 

You can see that the caller of txg_list_add() is dsl_dir_dirty+0x2e, which is the location of the testl %eax,%eax immediately after the call to txg_list_add(), which is at dsl_dir_dirty_0x29. Let's take a look at dsl_dir_dirty() (from here).

voiddsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx){    dsl_pool_t *dp = dd->dd_pool;    ASSERT(dd->dd_phys);    if (txg_list_add(&dp->dp_dirty_dirs, dd, tx->tx_txg)) {        /* up the hold count until we can be written out */        dmu_buf_add_ref(dd->dd_dbuf, dd);    }}

In this case, the code is adding a dsl_dir_t to the list headed by dp_dirty_dirs, which is a member of the dsl_pool_t structure. In other places, txg_list_add() is called with different lists and objects. I want to print the object (the dsl_dir_t) that is passed into the function. I first tried:

# dtrace -n 'txg_list_add:entry/caller == zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'dtrace: invalid probe specifier txg_list_add:entry/caller == zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}: in predicate: operands have incompatible types: "function pointer" + "int"#

Then I tried:

# dtrace -n 'txg_list_add:entry/caller == (uintptr_t)zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'dtrace: description 'txg_list_add:entry' matched 1 probe

But this probe never fired. I then tried printing the value of (uintptr_t)zfs`dsl_dir_dirty+0x2e.

# dtrace -n 'txg_list_add:entry{printf("caller = %x, dsl_dir_dirty+0x2e = %x\n", caller, (uintptr_t)zfs`dsl_dir_dirty+0x2e);}'dtrace: description 'txg_list_add:entry' matched 1 probeCPU   ID          FUNCTION:NAME 0 48419        txg_list_add:entry caller = fffffffff7cd3dae, dsl_dir_dirty+0x2e = 20ec8348e5894883 0 48419        txg_list_add:entry caller = fffffffff7cd3dae, dsl_dir_dirty+0x2e = 20ec8348e5894883...

Of course, caller might not have been dsl_dir_dirty+0x2e, but none of the callers matched (which explains why the preceding DTrace one-liner did not give any output). Why didn't they match? And what is 20ec8348e5894883?

After thinking about this for a short time, I realized that zfs`dsl_dir_dirty+0x2e is being interpreted as an address, and DTrace was going to that address and picking up the contents. In this case, the contents are code.

# mdb -k> dsl_dir_dirty,8/Bdsl_dir_dirty:dsl_dir_dirty: 55   48   89   e5   48   83  ec   20> 

The bytes are swapped (Intel is little endian). And when you add 0x2e plus 0x55, you get 0x83. So, instead of getting the contents, let's try the address.

# dtrace -n 'txg_list_add:entry/caller == (uintptr_t)&zfs`dsl_dir_dirty+0x2e/{print(*(dsl_dir_t *)arg1);}'dtrace: description 'txg_list_add:entry' matched 1 probeCPU   ID          FUNCTION:NAME 0 48419        txg_list_add:entry dsl_dir_t {  uint64_t dd_object = 0x4a  dsl_dir_phys_t *dd_phys = 0xffffff014a61d400  dmu_buf_t *dd_dbuf = 0xffffff014a61f650  dsl_pool_t *dd_pool = 0xffffff014a262340  txg_node_t dd_dirty_link = {    struct txg_node *[4] tn_next = [ 0, 0, 0, 0 ]    uint8_t [4] tn_member = [ 0x1, 0, 0, 0 ]  }  dsl_dir_t *dd_parent = 0xffffff014a296e00  kmutex_t dd_lock = {    void *[1] _opaque = [ 0 ]  }  list_t dd_prop_cbs = {    size_t list_size = 0x30    size_t list_offset = 0    struct list_node list_head = {      struct list_node *list_next = 0xffffff014ac59310      struct list_node *list_prev = 0xffffff014926a968    }  }  timestruc_t dd_snap_cmtime = {    time_t tv_sec = 0x51f67b0d    long tv_nsec = 0x1f8701b4  }  uint64_t dd_origin_txg = 0  uint64_t [4] dd_tempreserved = [ 0, 0, 0, 0 ]  int64_t [4] dd_space_towrite = [ 0x606000, 0, 0, 0 ]  char [256] dd_myname = [ "var" ]}

This works. What if I wanted to DTrace a function, but only if it was called from another specific function?

For instance, txg_list_add() is called from more than one place in zfs`vdev_dirty() (actually, if you look at the source code (see here), you'll see that vdev_dirty() has three different calls to txg_list_add()).

# dtrace -n 'txg_list_add:entry \/caller >= (uintptr_t)&zfs`vdev_dirty && caller < (uintptr_t)&zfs`vdev_dirty+sizeof(zfs`vdev_dirty)/ \{printf("txg_list_add called from %a\n", caller);}'dtrace: description 'txg_list_add:entry' matched 1 probeCPU   ID          FUNCTION:NAME 0 48419        txg_list_add:entry txg_list_add called from zfs`vdev_dirty+0x92 0 48419        txg_list_add:entry txg_list_add called from zfs`vdev_dirty+0x68...

Figuring out which two of the three calls to txg_list_add() are being made from vdev_dirty() will have to be the topic of another blog. Suffice it to say, that the two shown by DTrace may not correspond to the first and second calls in the C source code (and, in fact, may be the third and second or some other combination). Careful study of the assembler code is required (or compile the code using the -S option to get annotated assembler output).

I wanted to post more about ZFS transactions, but I am still interpreting the results of tracing all of the calls to txg_list_add(). I have a script which shows the objects added to the transaction group list, but I am (so far), underwhelmed with what they tell me about the transactions. Consider this a work in progress. One thing I can see is that most of the calls from a specific caller are passing in almost identical objects each time. Since I wasn't sure what I expected, I need to do more digging.



Post written by Mr. Max Bruning