What's in the ARC?

February 13, 2014 - by Mr. Max Bruning

This post assumes some knowledge of ZFS internals. You can obtain this knowledge by reading the source code, and/or by taking a ZFS Internals course. The post also makes extensive use of <a href="http://www.illumos.org/man/1/mdb">mdb(1). A full description of mdb is well beyond the scope of this blog post.

When teaching the ZFS Internals course, I often give students the following lab:

"For an application that reads data from a file, find the data in the ARC."

The ARC (Adjustable Replacement Cache) is an in-memory cache of recently and/or frequently accessed data/metadata from disk. ZFS file system (and volume) data and metadata are read/written via the ARC. A good description can be found in the source code at usr/src/uts/common/uts/fs/zfs/arc.c. A more general (and possibly more useful) question is to identify how much of the ARC a given file, file system, or volume is using.

To do the lab, we'll set up a simple ZFS pool using a file, then we'll put some (known) data into a file in the pool. Then we'll run a program to read the data, then we'll look for the data in ARC. For this lab, you'll need a system running SmartOS (illumos, OpenIndiana, and probably Solaris 10 and 11 variants should also work). The system should not be "busy". If there is a lot of file system activity, the data for the file may not stay cached for very long.

Here are the first steps:

# mkfile 100m /var/tmp/zfsfile  <-- create a file to be used for the pool.
# zpool create testpool /var/tmp/zfsfile
# cp /usr/dict/words /testpool/words  <-- our file with known data
# zpool export testpool
# zpool import -d /var/tmp testpool

We export the pool to clear the ARC of any data left from the cp(1).

Now we'll read in the words file and find it in ARC (or not, if the system is very busy). First we'll just go through the steps, then we'll go through some explanation.

# dd if=/testpool/words of=/dev/null bs=128k
1+1 records in
1+1 records out
# ls -i /testpool/words
      2040 /testpool/words
# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm sppp nfs ]
> 0t2040=K  <- convert inumber (object id) to hex
> ::dbufs -o 7f8 -n testpool| ::dbuf
        addr object lvl blkid holds os
ffffff00cf433e80      7f8 0         0  0 testpool
ffffff00cf43d9f8      7f8 0         1  0 testpool
ffffff00ce7a31b0      7f8 1         0  2 testpool
ffffff00ce7b8860      7f8 0     bonus  1 testpool
> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf
dmu_buf_t db = {
    uint64_t db.db_object = 0x7f8
    uint64_t db.db_offset = 0  <-- beginning of file
    uint64_t db.db_size = 0x20000  <-- 128k
    void *db.db_data = 0xffffff0043402000  <-- location of arc
data buffer
arc_buf_t *db_buf = 0xffffff00ccdb0ee0
> ffffff0043402000,10/c
                1st   <-- this is the beginning of the "words" file

> ffffff00ccdb0ee0::print -t arc_buf_t
arc_buf_t {
    arc_buf_hdr_t *b_hdr = 0xffffff00cfb0c708
    arc_buf_t *b_next = 0
    kmutex_t b_evict_lock = {
        void *[1] _opaque = [ 0 ]
    void *b_data = 0xffffff0043402000
    arc_evict_func_t *b_efunc = dbuf_do_evict
    void *b_private = 0xffffff00cf433e80
> ffffff00cfb0c708::print -t arc_buf_hdr_t
arc_buf_hdr_t {
    dva_t b_dva = {
        uint64_t [2] dva_word = [ 0x100, 0x24400 ]
    uint64_t b_birth = 0x1e596
    uint64_t b_cksum0 = 0x2f6c9bcce37c
    ... <-- output omitted
    arc_buf_hdr_t *b_hash_next = 0xffffff00c82e8a90
    arc_buf_t *b_buf = 0xffffff00ccdb0ee0
    uint64_t b_size = 0x20000
    uint64_t b_spa = 0x1fc28bd029207b7b
    arc_state_t *b_state = ARC_mfu  <-- buffer is in MFU list

The data structures used to maintain the ARC are arc_buf_hdr_t and arc_buf_t. These data structures are used to determine if a buffer is in ARC, and, if so, where (mru, mfu, mru ghost, mfu ghost, l2arc). (The ghost lists are used to determine when a mru or mfu cache is too small). But they do not identify what object the data/metadata holds. For this, the dmu_buf_impl_t structure (hereafter referred to as "dbuf" structures) can be used. Note that not everything in the ARC is mapped by dbufs.

The following diagram shows the data structures used by the DMU to manage data and metadata in the ARC.

DBUF_HASH(objset, objid, level, blkid)
|   ______       ------> hash chain of dmu_buf_impl_t structs
|   |-----|0     |          _              ___
|   |-----|      | |------>|_|------------>|  | dnode_t    __
|   |-----|    __|_|  dnode_handle_t _____ |__|----------->| |dnode_phys_t
|-->|-----|-->|    |--------------->|    |                 |_|(in metadata)
    |-----|   |____|dmu_buf_impl_t  |    |  data/metadata
    |-----|      |                  |    |   (or bonus buffer)
    |-----|      |-----------       |    |
    |-----|                  |      |    |
        |_____|hash_table_mask+1 |  /-->|____|
dbuf_hash_table.hash_table   |  |
                          __ V__|_
                              |      |
buf_hash(spa, dva, birth) |______|arc_buf_t (NULL for bonus buffer)
|                             ^
|  _____                      |
|  |----|0                 ___V___
|->|----|----------------->|      | arc_buf_hdr_t
   |----|                  |______|
   |----|                       |------> hash chain of arc_buf_hdr_t

The following describes the mdb commands that were used to find the data.

> ::dbufs -o 7f8 -n testpool| ::dbuf

The ::dbufs dcmd walks the dmu_buf_impl_t cache of allocated ::dbufs. The "-o 7f8" only displays entries with object id 0x7f8, the "inumber" of the words file, and the "-n testpool" only shows those entries in the testpool object set. The "::dbuf" dcmd displays a summary of the dmu_buf_impl_t.

The output of the above command shows the address of the dmu_buf_impl_t, the object id, the level of indirection, the block id, the number of holds on the object, and the object set name. ZFS can use up to 6 levels of indirect blocks.

The object id will either be a number (for instance, 0x7f8), or "mdn" (meta dnode), which is used for objset_phys_t structures which are in memory. The objset_phys_t data structure contains information about the meta object set (the MOS), which describes the root of a pool, child datasets, clones, snapshots, dedupe table, volumes, and the space map for a pool, among other things. There are also objset_phys_t structures for each dataset, clone, volume, child dataset, and snapshot which locates the objects (files, directories) within the object set.

The block id identifies which block in the object is referenced by the dmu_buf_impl_t, or the block id contains the string "bonus". The bonus buffer (a field in the dnode_phys_t) contains attributes (ownership, timestamps, permissions, etc.) of an object. Note that entries marked "bonus" have a NULL value for the arc_buf_t * field in the dmu_buf_impl_t. The bonus buffer is in the ARC, but is there as part of the dnode_phys_t for the object. The bonus DMU buffers are copies of the data from the corresponding dnode_phys_t. And the dnode_phys_t that contains the bonus buffer is also in the DMU cache (and ARC).

The "holds" value says how many things are currently using the DMU buffer. The buffer can not be freed if the hold count is non-zero.

Let's look at the output of "::dbufs -o 7f8 -n testpool|


    addr object lvl blkid holds os

ffffff00cf433e80 7f8 0 0 0 testpool ffffff00cf43d9f8 7f8 0 1 0 testpool ffffff00ce7a31b0 7f8 1 0 2 testpool ffffff00ce7b8860 7f8 0 bonus 1 testpool

For object id 0x7f8, there are 4 dbufs. The first is for the first block (128k bytes), and the second is for the second block in the file. The third one is for a level 1 indirect block. It contains block pointers that contain the blocks described by the first two entries. The last one is for the bonus buffer for the file. The dnode_phys_t describing the file is in an "mdn" dbuf.

At this point, we get more information about the first dbuf.

> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf db_dnode_handle
dmu_buf_t db = {
    uint64_t db.db_object = 0x7f8
    uint64_t db.db_offset = 0
    uint64_t db.db_size = 0x20000
    void *db.db_data = 0xffffff0043402000
arc_buf_t *db_buf = 0xffffff00ccdb0ee0
struct dnode_handle *db_dnode_handle = 0xffffff00ccf520c8

The db member describes the buffer. The db.db_data field is the address of where the buffer starts in memory. Going to that address shows the first 128k of data for the words file.

> ffffff0043402000,20000/c
                1st   <-- this is the beginning of the "words" file

The arc_buf_t contains a pointer to the arc_buf_hdr_t for the buffer, which in turn shows that the buffer is in the ARC_mfu cache. Note that the address of the buffer in the arc_buf_t (b_data) matches the db_data field in the dmu_buf_impl_t. The b_private field in the arc_buf_t is a pointer back to the dmu_buf_impl_t.

Now let's look at the dnode_t for the file.

> ffffff00cf433e80::print -t dmu_buf_impl_t db_dnode_handle | ::print -t dnode_handle_t dnh_dnode | ::print -t dnode_t
dnode_t {
    list_node_t dn_link = {   <-- linked list of all dnodes on system
    struct objset *dn_objset = 0xffffff00cd063040
    uint64_t dn_object = 0x7f8
    struct dmu_buf_impl *dn_dbuf = 0xffffff00ce7ae268  <-- dbuf for dnode_phys_t
    struct dnode_handle *dn_handle = 0xffffff00ccf520c8
    dnode_phys_t *dn_phys = 0xffffff00cf8e8000
    dmu_object_type_t dn_type = 0t19 (DMU_OT_PLAIN_FILE_CONTENTS)
    uint16_t dn_bonuslen = 0xa8
    uint8_t dn_bonustype = 0x2c
    uint32_t dn_dbufs_count = 0x4
    refcount_t dn_holds = {
        uint64_t rc_count = 0x4
    list_t dn_dbufs = {  <-- dbufs with this dnode_t
    struct dmu_buf_impl *dn_bonus = 0xffffff00ce7b8860

The dnode_t contains a pointer to a dmu_buf_impl_t. Let's look at this:

> ffffff00ce7ae268::dbuf
        addr object lvl blkid holds os
ffffff00ce7ae268      mdn 0        3f  1 testpool

So, the dbuf that contains the dnode_phys_t for the words file is a meta dnode object, at indirect level 0 and at block id 0x3f. Let's take a closer look.

> ffffff00ce7ae268::print -t dmu_buf_impl_t
dmu_buf_impl_t {
    dmu_buf_t db = {
        uint64_t db_object = 0
        uint64_t db_offset = 0xfc000
        uint64_t db_size = 0x4000
        void *db_data = 0xffffff00cf8e5000
    struct objset *db_objset = 0xffffff00cd063040
    struct dnode_handle *db_dnode_handle = 0xffffff00cd063060
    struct dmu_buf_impl *db_parent = 0xffffff00ce795e48
    struct dmu_buf_impl *db_hash_next = 0
    uint64_t db_blkid = 0x3f
    blkptr_t *db_blkptr = 0xffffff00cf8f2f80
    dbuf_states_t db_state = 4 (DB_CACHED)
    refcount_t db_holds = {
        uint64_t rc_count = 0x1
    arc_buf_t *db_buf = 0xffffff00ccdb0d30
    void *db_user_ptr = 0xffffff00ccf51e80

This dbuf is for a 16k (0x4000) byte block at offset 0xfc000. Note that the blkid (0x3f) times the block size (0x4000) gives the offset of 0xfc000. This is a block of dnode_phys_t structures.

> ::sizeof dnode_phys_t
sizeof (dnode_phys_t) = 0x200
> 4000%200=K  <-- block size is 0x4000, dnode_phys_t size is 0x200
                20              <-- 32 dnode_phys_t / block
> 7f8%20=K  <-- 0x748 is the object id for words
                3f  <-- matches the db_blkid
> 3f*20=K   <-- where does block containing 7f8 begin?
> 7f8-7e0=K  <-- get offset from beginning of block
> ffffff00cf8e5000+(18*200)=K  <-- get address of dnode_phys_t
                                   for words file
            ffffff00cf8e8000 <-- matches dn_phys in dnode_t above

If the ARC buffer is evicted, a callback (dbuf_do_evict()) will clean up the dmu_buf_impl_t. See the comment before dbuf_clear() in uts/common/fs/zfs/dbuf.c for some details. Here is the same ::dbufs command run as before, but after some ARC/dbuf evictions.

> ::dbufs -o 7f8 -n testpool | ::dbuf
        addr object lvl blkid holds os
ffffff00cf433e80      7f8 0         0  0 testpool
ffffff00ce7a31b0      7f8 1         0  1 testpool
ffffff00ce7b8860      7f8 0     bonus  1 testpool

So, one of the buffers (the one containing the second block of the file) is no longer cached.

An interesting question to ask may be: For a given file, how much of the file data/metadata is in ARC?

To do this, we'll use a file I have been intermittently looking at over time.

# ls -i /var/tmp/foo.out
      1337 -rw-r--r--   1 root     root     13328871 Jan  6 09:56 /var/tmp/foo.out
# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic
      uppc apix scsi_vhci ufs ip hook neti sockfs arp usba
      stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc
      logindmux ptm sppp nfs ]
> ::dbufs -o 0t1337 | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
6832128  <-- decimal, about 6.8MB

Maybe more interesting is how much ARC space a given dataset or volume is using. The following shows total space in ARC used by the testpool dataset.

> ::dbufs -n testpool | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'

Volumes are used in the Joyent Public Cloud (JPC) for kvm-based virtual machines. Let's get the space used by a Fedora instance. First, the list of virtual machines.

# vmadm list
UUID                                  TYPE  RAM      STATE             ALIAS
8d02acc6-a9cc-e033-f196-aa6841702872  OS    768      running           vm1
9d3ccd6c-511b-60c5-db07-d742941bb62b  KVM   1024     running           ubuntu-1
a5528066-171a-694f-85ef-cac9928c9fd3  OS    2048     running           vm1
dd6fb539-cfac-c84a-f336-d1232a6f673e  OS    2048     running           -
62293978-3947-eb5a-dcdf-a6b4728b39bf  KVM   8192     running           maxfedora
ed3f45b1-833a-438d-8214-3876a58d9371  OS    8192     running

Volumes are more difficult as we cannot use the volume name with the "-n" option and "::dbufs". The following command line walks through the set of all dnode_t on the system. For each one, it gets the type of the dnode_t looking for type value of 0x17 (DMU_OT_ZVOL, e.g., volume). For dnode_t of that type, it prints out the name of the volume.

> ::walk dnode_t d | ::print dnode_t dn_phys | ::print
      dnode_phys_t dn_type | ::grep ".==17" | ::eval '

This is a long one-liner. Briefly, it walks the list of dnode_t in memory. For each dnode_t, the walker stores the address of the dnode_t in an mdb variable ("d"). For each dnode_t, it prints the dn_phys value (dn_phys is the address of a dnode_phys_t, which is an in-memory copy of the same data structure that is on disk. The dn_type field refers to the "type" of the dnode_phys_t. If the type is 0x17, the dnode_t (and corresponding dnode_phys_t), is for a ZFS volume. The "::eval" gets the value of the "d" variable (the dnode_t with type equal to 0x17) and prints the dn_objset for that dnode. At the end, this one-liner will list the names of all of the ZFS volumes currently on the system.

To find the amount of ARC space consumed by the "maxfedora" virtual machine (uuid = 62293978-3947-eb5a-dcdf-a6b4728b39bf-disk1), we can find all dbufs whose dnode handle takes us to the dnode_t for the volume. Using the above command, we want the first dnode_t.

> ::walk dnode_t d | ::print dnode_t dn_phys | ::print
      dnode_phys_t dn_type | ::grep ".==17" | ::eval '

The dnode_t contains a list of all dbufs that are used for it. We'll walk the list of dbufs, and for each one that has a non-NULL arc buf pointer, we'll get the size from the arc buf header and add them up. To walk the list of dbufs in the dnode_t, we need to know the address of the list.

> ::offsetof dnode_t dn_dbufs
offsetof (dnode_t, dn_dbufs) = 0x248, sizeof (...->dn_dbufs) = 0x20

Adding the offset of the dn_dbufs member to the address of the dnode_t for the "62293978-3947-eb5a-dcdf-a6b4728b39bf-disk0" volume, we'll walk the list of dbufs for the volume. This volume is the system disk for the "maxfedora" image.

> ffffff0dc5845710 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'
216375296  <-- number of bytes mapped in ARC for the volume

And here is the data disk (/data) in the Fedora instance.

> ffffff11c639ebe0 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'

As mentioned earlier, not all of what is in the ARC is mapped by dbufs. First of all, not all dbufs refer to arc buffers (the db_buf field in the dmu_buf_impl_t can be NULL). All but 2 of these instances of the dbufs are to get quick access to bonus buffers. The bonus buffers are in the ARC, but as part of the dnode_phys_t which contains them. There are many arc buffers that do not have a pointer back to a dbuf (b_private in the arc_buf_t is NULL). I have looked at some of these arc buffers and have found a few different types of metadata, but also some buffers which contain data. One possible reason for this is prefetch.

Here is a way to see all of ARC that is mapped by dbufs.

> ::walk dmu_buf_impl_t d| ::print dmu_buf_impl_t db_buf |
::grep ".!=0" | ::eval "

And here is a way to see space in ARC.

> ::walk arc_buf_t | ::print -t arc_buf_t b_hdr | ::print -t
      -d arc_buf_hdr_t b_size !sed -e 's/uint64_t b_size = 0t//' |
      awk '{sum+=$1} END{print sum}'

Note that this number matches closely with the size shown by:

> ::arc !grep size
size                      =      1200 MB

Determining the cause of the difference (1181588480 for dbufs, and 1236963840 for arc buffers) is left as an exercise for the reader.

All of this is very interesting, but also quite a few steps. It would be nice to have an lsarc command that lists what files are in the arc, how much data/metadata is in arc for a given file/dataset/volume, a breakdown between data and metadata, and even which arc cache the data is on (MRU or MFU). Once you understand that the dbufs provide a map of (most of) the arc, this command becomes possible.