What's in the ARC?

This post assumes some knowledge of ZFS internals. You can obtainthis knowledge by reading the source code, and/or by taking a ZFS Internals course.The post also makes extensive use of <ahref="http://www.illumos.org/man/1/mdb">mdb(1). A fulldescription of mdb is well beyond the scope of this blog post.

When teaching the ZFS Internals course, I often give students thefollowing lab:

"For an application that reads data from a file, find the data in theARC."

The ARC (Adjustable Replacement Cache) is an in-memory cache ofrecently and/or frequently accessed data/metadata from disk.ZFS file system (and volume) data and metadata are read/written viathe ARC. A good description can be found in the source code atusr/src/uts/common/uts/fs/zfs/arc.c.A more general (and possibly more useful) question is to identify howmuch of the ARC a given file, file system, or volume is using.

To do the lab, we'll set up a simple ZFS pool using a file, then we'llput some (known) data into a file in the pool. Then we'll run aprogram to read the data, then we'll look for the data in ARC. Forthis lab, you'll need a system running SmartOS (illumos, OpenIndiana,and probably Solaris 10 and 11 variants should also work). The systemshould not be "busy". If there is a lot of file system activity, thedata for the file may not stay cached for very long.

Here are the first steps:

# mkfile 100m /var/tmp/zfsfile  <-- create a file to be used for the pool.# zpool create testpool /var/tmp/zfsfile# cp /usr/dict/words /testpool/words  <-- our file with known data# zpool export testpool# zpool import -d /var/tmp testpool

We export the pool to clear the ARC of any data left from the cp(1).

Now we'll read in the words file and find it in ARC (or not, if thesystem is very busy). First we'll just go through the steps, thenwe'll go through some explanation.

# dd if=/testpool/words of=/dev/null bs=128k1+1 records in1+1 records out## ls -i /testpool/words      2040 /testpool/words# mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm sppp nfs ]> 0t2040=K  <- convert inumber (object id) to hex                7f8>> ::dbufs -o 7f8 -n testpool| ::dbuf        addr object lvl blkid holds osffffff00cf433e80      7f8 0         0  0 testpoolffffff00cf43d9f8      7f8 0         1  0 testpoolffffff00ce7a31b0      7f8 1         0  2 testpoolffffff00ce7b8860      7f8 0     bonus  1 testpool>>> ffffff00cf433e80::print -t dmu_buf_impl_t db db_bufdmu_buf_t db = {    uint64_t db.db_object = 0x7f8    uint64_t db.db_offset = 0  <-- beginning of file    uint64_t db.db_size = 0x20000  <-- 128k    void *db.db_data = 0xffffff0043402000  <-- location of arcdata buffer}arc_buf_t *db_buf = 0xffffff00ccdb0ee0>> ffffff0043402000,10/c                1st   <-- this is the beginning of the "words" file                2nd                3rd> ffffff00ccdb0ee0::print -t arc_buf_tarc_buf_t {    arc_buf_hdr_t *b_hdr = 0xffffff00cfb0c708    arc_buf_t *b_next = 0    kmutex_t b_evict_lock = {        void *[1] _opaque = [ 0 ]    }    void *b_data = 0xffffff0043402000    arc_evict_func_t *b_efunc = dbuf_do_evict    void *b_private = 0xffffff00cf433e80}> ffffff00cfb0c708::print -t arc_buf_hdr_tarc_buf_hdr_t {    dva_t b_dva = {        uint64_t [2] dva_word = [ 0x100, 0x24400 ]    }    uint64_t b_birth = 0x1e596    uint64_t b_cksum0 = 0x2f6c9bcce37c    ... <-- output omitted    arc_buf_hdr_t *b_hash_next = 0xffffff00c82e8a90    arc_buf_t *b_buf = 0xffffff00ccdb0ee0    ...    uint64_t b_size = 0x20000    uint64_t b_spa = 0x1fc28bd029207b7b    arc_state_t *b_state = ARC_mfu  <-- buffer is in MFU list    ...}>>

The data structures used to maintain the ARC arearc_buf_hdr_t and arc_buf_t. These datastructures are used to determine if a buffer is in ARC, and, if so,where (mru, mfu, mru ghost, mfu ghost, l2arc). (The ghost lists areused to determine when a mru or mfu cache is too small). Butthey do not identify what object the data/metadata holds. For this,the dmu_buf_impl_t structure (hereafter referred to as"dbuf" structures) can be used. Note that noteverything in the ARC is mapped by dbufs.

The following diagram shows the data structures used by the DMU tomanage data and metadata in the ARC.

DBUF_HASH(objset, objid, level, blkid)||   ______       ------> hash chain of dmu_buf_impl_t structs|   |-----|0     |          _              ___|   |-----|      | |------>|_|------------>|  | dnode_t    __|   |-----|    __|_|  dnode_handle_t _____ |__|----------->| |dnode_phys_t|-->|-----|-->|    |--------------->|    |                 |_|(in metadata)    |-----|   |____|dmu_buf_impl_t  |    |  data/metadata    |-----|      |                  |    |   (or bonus buffer)    |-----|      |-----------       |    |    |-----|                  |      |    |        |_____|hash_table_mask+1 |  /-->|____|dbuf_hash_table.hash_table   |  |                          __ V__|_                              |      |buf_hash(spa, dva, birth) |______|arc_buf_t (NULL for bonus buffer)|                             ^|  _____                      ||  |----|0                 ___V___|->|----|----------------->|      | arc_buf_hdr_t   |----|                  |______|   |----|                       |------> hash chain of arc_buf_hdr_t   |____|ht_mask+1   buf_hash_table.ht_table

The following describes the mdb commands that were used to find the data.

> ::dbufs -o 7f8 -n testpool| ::dbuf

The ::dbufs dcmd walks the dmu_buf_impl_tcache of allocated ::dbufs. The "-o7f8" only displays entries with object id 0x7f8, the "inumber"of the words file, and the "-n testpool" only shows thoseentries in the testpool object set. The "::dbuf" dcmddisplays a summary of the dmu_buf_impl_t.

The output of the above command shows the address of thedmu_buf_impl_t, the object id, the level of indirection,the block id, the number of holds on the object, and the object set name.ZFS can use up to 6 levels of indirect blocks.

The object id will either be a number (for instance, 0x7f8), or "mdn" (meta dnode), which is used forobjset_phys_t structures which are in memory. Theobjset_phys_t data structure contains information aboutthe meta object set (the MOS), which describes the root of a pool,child datasets, clones, snapshots, dedupe table, volumes, and the space map fora pool, among other things. There are also objset_phys_tstructures for each dataset, clone, volume, child dataset, andsnapshot which locates the objects (files, directories) within theobject set.

The block id identifies which block in the object is referenced by thedmu_buf_impl_t, or the block id contains the string"bonus". The bonus buffer (a field in the dnode_phys_t)contains attributes (ownership, timestamps, permissions, etc.) of anobject. Note that entries marked "bonus" havea NULL value for the arc_buf_t * field in thedmu_buf_impl_t. The bonus buffer is in the ARC, but isthere as part of the dnode_phys_t for the object. Thebonus DMU buffers are copies of the data from the correspondingdnode_phys_t. And the dnode_phys_t thatcontains the bonus buffer is also in the DMU cache (and ARC).

The "holds" value says how many things are currently using the DMUbuffer. The buffer can not be freed if the hold count is non-zero.

Let's look at the output of "::dbufs -o 7f8 -n testpool|

:dbuf".

    addr object lvl blkid holds os

ffffff00cf433e80 7f8 0 0 0 testpoolffffff00cf43d9f8 7f8 0 1 0 testpoolffffff00ce7a31b0 7f8 1 0 2 testpoolffffff00ce7b8860 7f8 0 bonus 1 testpool

For object id 0x7f8, there are 4 dbufs. The first is for the first block (128kbytes), and the second is for the second block in the file. The third one isfor a level 1 indirect block. It contains block pointers that containthe blocks described by the first two entries. The last one is for thebonus buffer for the file. The dnode_phys_t describingthe file is in an "mdn" dbuf.

At this point, we get more information about the first dbuf.

> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf db_dnode_handledmu_buf_t db = {    uint64_t db.db_object = 0x7f8    uint64_t db.db_offset = 0    uint64_t db.db_size = 0x20000    void *db.db_data = 0xffffff0043402000}arc_buf_t *db_buf = 0xffffff00ccdb0ee0struct dnode_handle *db_dnode_handle = 0xffffff00ccf520c8>

The db member describes the buffer. Thedb.db_data field is the address of where the bufferstarts in memory. Going to that address shows the first 128k of datafor the words file.

> ffffff0043402000,20000/c                1st   <-- this is the beginning of the "words" file                2nd                3rd                ...>

The arc_buf_t contains a pointer to thearc_buf_hdr_t for the buffer, which in turn shows thatthe buffer is in the ARC_mfu cache. Note that theaddress of the buffer in the arc_buf_t(b_data) matches the db_data field in thedmu_buf_impl_t. The b_private field in thearc_buf_t is a pointer back to the dmu_buf_impl_t.

Now let's look at the dnode_t for the file.

> ffffff00cf433e80::print -t dmu_buf_impl_t db_dnode_handle | ::print -t dnode_handle_t dnh_dnode | ::print -t dnode_tdnode_t {...    list_node_t dn_link = {   <-- linked list of all dnodes on system    ...    }    struct objset *dn_objset = 0xffffff00cd063040    uint64_t dn_object = 0x7f8    struct dmu_buf_impl *dn_dbuf = 0xffffff00ce7ae268  <-- dbuf for dnode_phys_t    struct dnode_handle *dn_handle = 0xffffff00ccf520c8    dnode_phys_t *dn_phys = 0xffffff00cf8e8000    dmu_object_type_t dn_type = 0t19 (DMU_OT_PLAIN_FILE_CONTENTS)    uint16_t dn_bonuslen = 0xa8    uint8_t dn_bonustype = 0x2c    ...    uint32_t dn_dbufs_count = 0x4    ...    refcount_t dn_holds = {        uint64_t rc_count = 0x4    }    ...    list_t dn_dbufs = {  <-- dbufs with this dnode_t    ...    }    struct dmu_buf_impl *dn_bonus = 0xffffff00ce7b8860    ...}

The dnode_t contains a pointer to admu_buf_impl_t.Let's look at this:

> ffffff00ce7ae268::dbuf        addr object lvl blkid holds osffffff00ce7ae268      mdn 0        3f  1 testpool>

So, the dbuf that contains the dnode_phys_t for the wordsfile is a meta dnode object, at indirect level 0 and at block id 0x3f.Let's take a closer look.

> ffffff00ce7ae268::print -t dmu_buf_impl_tdmu_buf_impl_t {    dmu_buf_t db = {        uint64_t db_object = 0        uint64_t db_offset = 0xfc000        uint64_t db_size = 0x4000        void *db_data = 0xffffff00cf8e5000    }    struct objset *db_objset = 0xffffff00cd063040    struct dnode_handle *db_dnode_handle = 0xffffff00cd063060    struct dmu_buf_impl *db_parent = 0xffffff00ce795e48    struct dmu_buf_impl *db_hash_next = 0    uint64_t db_blkid = 0x3f    blkptr_t *db_blkptr = 0xffffff00cf8f2f80    ...    dbuf_states_t db_state = 4 (DB_CACHED)    refcount_t db_holds = {        uint64_t rc_count = 0x1    }    arc_buf_t *db_buf = 0xffffff00ccdb0d30    ...    void *db_user_ptr = 0xffffff00ccf51e80    ...}

This dbuf is for a 16k (0x4000) byte block at offset 0xfc000. Notethat the blkid (0x3f) times the block size (0x4000) gives the offsetof 0xfc000. This is a block of dnode_phys_t structures.

> ::sizeof dnode_phys_tsizeof (dnode_phys_t) = 0x200>> 4000%200=K  <-- block size is 0x4000, dnode_phys_t size is 0x200                20              <-- 32 dnode_phys_t / block>> 7f8%20=K  <-- 0x748 is the object id for words                3f  <-- matches the db_blkid> 3f*20=K   <-- where does block containing 7f8 begin?                7e0> 7f8-7e0=K  <-- get offset from beginning of block                18> ffffff00cf8e5000+(18*200)=K  <-- get address of dnode_phys_t                                   for words file            ffffff00cf8e8000 <-- matches dn_phys in dnode_t above

If the ARC buffer is evicted, a callback (dbuf_do_evict())will clean up the dmu_buf_impl_t. See the comment beforedbuf_clear() in uts/common/fs/zfs/dbuf.c for somedetails. Here is the same ::dbufs command run as before,but after some ARC/dbuf evictions.

> ::dbufs -o 7f8 -n testpool | ::dbuf        addr object lvl blkid holds osffffff00cf433e80      7f8 0         0  0 testpoolffffff00ce7a31b0      7f8 1         0  1 testpoolffffff00ce7b8860      7f8 0     bonus  1 testpool

So, one of the buffers (the one containing the second block of thefile) is no longer cached.

An interesting question to ask may be: For a given file, how muchof the file data/metadata is in ARC?

To do this, we'll use a file I have been intermittently looking atover time.

# ls -i /var/tmp/foo.out      1337 -rw-r--r--   1 root     root     13328871 Jan  6 09:56 /var/tmp/foo.out## mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic      uppc apix scsi_vhci ufs ip hook neti sockfs arp usba      stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc      logindmux ptm sppp nfs ]> ::dbufs -o 0t1337 | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'6832128  <-- decimal, about 6.8MB>

Maybe more interesting is how much ARC space a given dataset or volumeis using. The following shows total space in ARC used by the testpooldataset.

> ::dbufs -n testpool | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'411648>

Volumes are used in the Joyent Public Cloud (JPC) for kvm-basedvirtual machines. Let's get the space used by a Fedora instance.First, the list of virtual machines.

# vmadm listUUID                                  TYPE  RAM      STATE             ALIAS8d02acc6-a9cc-e033-f196-aa6841702872  OS    768      running           vm19d3ccd6c-511b-60c5-db07-d742941bb62b  KVM   1024     running           ubuntu-1a5528066-171a-694f-85ef-cac9928c9fd3  OS    2048     running           vm1dd6fb539-cfac-c84a-f336-d1232a6f673e  OS    2048     running           -62293978-3947-eb5a-dcdf-a6b4728b39bf  KVM   8192     running           maxfedoraed3f45b1-833a-438d-8214-3876a58d9371  OS    8192     runningmoray1

Volumes are more difficult as we cannot use the volume name with the"-n" option and "::dbufs". The following command line walks throughthe set of all dnode_t on the system. For each one, itgets the type of the dnode_t looking for type value of0x17 (DMU_OT_ZVOL, e.g., volume). For dnode_t of thattype, it prints out the name of the volume.

> ::walk dnode_t d | ::print dnode_t dn_phys | ::print      dnode_phys_t dn_type | ::grep ".==17" | ::eval '

This is a long one-liner. Briefly, it walks the list of dnode_t inmemory. For each dnode_t, the walker stores the addressof the dnode_t in an mdb variable ("d"). For eachdnode_t, it prints the dn_phys value(dn_phys is the address of a dnode_phys_t,which is an in-memory copy of the same data structure that is ondisk. The dn_type field refers to the "type" of thednode_phys_t. If the type is 0x17, the dnode_t(and corresponding dnode_phys_t), is for a ZFS volume.The "::eval" gets the value of the "d" variable (thednode_t with type equal to 0x17) and prints thedn_objset for that dnode. At the end, this one-linerwill list the names of all of the ZFS volumes currently on the system.

To find the amount of ARC space consumed by the "maxfedora" virtualmachine (uuid = 62293978-3947-eb5a-dcdf-a6b4728b39bf-disk1), we canfind all dbufs whose dnode handle takes us to thednode_t for the volume. Using the above command, we want the firstdnode_t.

> ::walk dnode_t d | ::print dnode_t dn_phys | ::print      dnode_phys_t dn_type | ::grep ".==17" | ::eval '

The dnode_t contains a list of all dbufs that are usedfor it. We'll walk the list of dbufs, and for each one that has a non-NULL arcbuf pointer, we'll get the size from the arc buf header and add themup. To walk the list of dbufs in the dnode_t, we need toknow the address of the list.

> ::offsetof dnode_t dn_dbufsoffsetof (dnode_t, dn_dbufs) = 0x248, sizeof (...->dn_dbufs) = 0x20>

Adding the offset of the dn_dbufs member to the addressof the dnode_t for the "62293978-3947-eb5a-dcdf-a6b4728b39bf-disk0"volume, we'll walk the list of dbufs for the volume. This volume isthe system disk for the "maxfedora" image.

> ffffff0dc5845710 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'216375296  <-- number of bytes mapped in ARC for the volume>

And here is the data disk (/data) in the Fedora instance.

> ffffff11c639ebe0 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'743317504>

As mentioned earlier, not all of what is in the ARC is mapped bydbufs. First of all, not all dbufs refer to arc buffers (thedb_buf field in the dmu_buf_impl_t can beNULL). All but 2 of these instances of the dbufs are to get quickaccess to bonus buffers. The bonus buffers are in the ARC, but aspart of the dnode_phys_t which contains them. There aremany arc buffers that do not have a pointer back to a dbuf(b_private in the arc_buf_t is NULL). Ihave looked at some of these arc buffers and have found a fewdifferent types of metadata, but also some buffers which containdata. One possible reason for this is prefetch.

Here is a way to see all of ARC that is mapped by dbufs.

> ::walk dmu_buf_impl_t d| ::print dmu_buf_impl_t db_buf |::grep ".!=0" | ::eval "

And here is a way to see space in ARC.

> ::walk arc_buf_t | ::print -t arc_buf_t b_hdr | ::print -t      -d arc_buf_hdr_t b_size !sed -e 's/uint64_t b_size = 0t//' |      awk '{sum+=$1} END{print sum}'1236963840>

Note that this number matches closely with the size shown by:

> ::arc !grep sizesize                      =      1200 MB...>

Determining the cause of the difference (1181588480 for dbufs, and1236963840 for arc buffers) is left as an exercise for the reader.

All of this is very interesting, but also quite a few steps. It wouldbe nice to have an lsarc command that lists what filesare in the arc, how much data/metadata is in arc for a givenfile/dataset/volume, a breakdown between data and metadata, and evenwhich arc cache the data is on (MRU or MFU). Once you understand thatthe dbufs provide a map of (most of) the arc, this command becomespossible.



Post written by Mr. Max Bruning