What's in the ARC?
This post assumes some knowledge of ZFS internals. You can obtainthis knowledge by reading the source code, and/or by taking a ZFS Internals course.The post also makes extensive use of <ahref="http://www.illumos.org/man/1/mdb">mdb(1). A fulldescription of mdb is well beyond the scope of this blog post.
When teaching the ZFS Internals course, I often give students thefollowing lab:
"For an application that reads data from a file, find the data in theARC."
The ARC (Adjustable Replacement Cache) is an in-memory cache ofrecently and/or frequently accessed data/metadata from disk.ZFS file system (and volume) data and metadata are read/written viathe ARC. A good description can be found in the source code atusr/src/uts/common/uts/fs/zfs/arc.c
.A more general (and possibly more useful) question is to identify howmuch of the ARC a given file, file system, or volume is using.
To do the lab, we'll set up a simple ZFS pool using a file, then we'llput some (known) data into a file in the pool. Then we'll run aprogram to read the data, then we'll look for the data in ARC. Forthis lab, you'll need a system running SmartOS (illumos, OpenIndiana,and probably Solaris 10 and 11 variants should also work). The systemshould not be "busy". If there is a lot of file system activity, thedata for the file may not stay cached for very long.
Here are the first steps:
# mkfile 100m /var/tmp/zfsfile <-- create a file to be used for the pool.# zpool create testpool /var/tmp/zfsfile# cp /usr/dict/words /testpool/words <-- our file with known data# zpool export testpool# zpool import -d /var/tmp testpool
We export the pool to clear the ARC of any data left from the cp(1).
Now we'll read in the words file and find it in ARC (or not, if thesystem is very busy). First we'll just go through the steps, thenwe'll go through some explanation.
# dd if=/testpool/words of=/dev/null bs=128k1+1 records in1+1 records out## ls -i /testpool/words 2040 /testpool/words# mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm sppp nfs ]> 0t2040=K <- convert inumber (object id) to hex 7f8>> ::dbufs -o 7f8 -n testpool| ::dbuf addr object lvl blkid holds osffffff00cf433e80 7f8 0 0 0 testpoolffffff00cf43d9f8 7f8 0 1 0 testpoolffffff00ce7a31b0 7f8 1 0 2 testpoolffffff00ce7b8860 7f8 0 bonus 1 testpool>>> ffffff00cf433e80::print -t dmu_buf_impl_t db db_bufdmu_buf_t db = { uint64_t db.db_object = 0x7f8 uint64_t db.db_offset = 0 <-- beginning of file uint64_t db.db_size = 0x20000 <-- 128k void *db.db_data = 0xffffff0043402000 <-- location of arcdata buffer}arc_buf_t *db_buf = 0xffffff00ccdb0ee0>> ffffff0043402000,10/c 1st <-- this is the beginning of the "words" file 2nd 3rd> ffffff00ccdb0ee0::print -t arc_buf_tarc_buf_t { arc_buf_hdr_t *b_hdr = 0xffffff00cfb0c708 arc_buf_t *b_next = 0 kmutex_t b_evict_lock = { void *[1] _opaque = [ 0 ] } void *b_data = 0xffffff0043402000 arc_evict_func_t *b_efunc = dbuf_do_evict void *b_private = 0xffffff00cf433e80}> ffffff00cfb0c708::print -t arc_buf_hdr_tarc_buf_hdr_t { dva_t b_dva = { uint64_t [2] dva_word = [ 0x100, 0x24400 ] } uint64_t b_birth = 0x1e596 uint64_t b_cksum0 = 0x2f6c9bcce37c ... <-- output omitted arc_buf_hdr_t *b_hash_next = 0xffffff00c82e8a90 arc_buf_t *b_buf = 0xffffff00ccdb0ee0 ... uint64_t b_size = 0x20000 uint64_t b_spa = 0x1fc28bd029207b7b arc_state_t *b_state = ARC_mfu <-- buffer is in MFU list ...}>>
The data structures used to maintain the ARC arearc_buf_hdr_t
and arc_buf_t
. These datastructures are used to determine if a buffer is in ARC, and, if so,where (mru, mfu, mru ghost, mfu ghost, l2arc). (The ghost lists areused to determine when a mru or mfu cache is too small). Butthey do not identify what object the data/metadata holds. For this,the dmu_buf_impl_t
structure (hereafter referred to as"dbuf" structures) can be used. Note that noteverything in the ARC is mapped by dbufs.
The following diagram shows the data structures used by the DMU tomanage data and metadata in the ARC.
DBUF_HASH(objset, objid, level, blkid)|| ______ ------> hash chain of dmu_buf_impl_t structs| |-----|0 | _ ___| |-----| | |------>|_|------------>| | dnode_t __| |-----| __|_| dnode_handle_t _____ |__|----------->| |dnode_phys_t|-->|-----|-->| |--------------->| | |_|(in metadata) |-----| |____|dmu_buf_impl_t | | data/metadata |-----| | | | (or bonus buffer) |-----| |----------- | | |-----| | | | |_____|hash_table_mask+1 | /-->|____|dbuf_hash_table.hash_table | | __ V__|_ | |buf_hash(spa, dva, birth) |______|arc_buf_t (NULL for bonus buffer)| ^| _____ || |----|0 ___V___|->|----|----------------->| | arc_buf_hdr_t |----| |______| |----| |------> hash chain of arc_buf_hdr_t |____|ht_mask+1 buf_hash_table.ht_table
The following describes the mdb commands that were used to find the data.
> ::dbufs -o 7f8 -n testpool| ::dbuf
The ::dbufs
dcmd walks the dmu_buf_impl_t
cache of allocated ::dbufs
. The "-o7f8
" only displays entries with object id 0x7f8, the "inumber"of the words file, and the "-n testpool
" only shows thoseentries in the testpool object set. The "::dbuf
" dcmddisplays a summary of the dmu_buf_impl_t
.
The output of the above command shows the address of thedmu_buf_impl_t
, the object id, the level of indirection,the block id, the number of holds on the object, and the object set name.ZFS can use up to 6 levels of indirect blocks.
The object id will either be a number (for instance, 0x7f8), or "mdn" (meta dnode), which is used forobjset_phys_t
structures which are in memory. Theobjset_phys_t
data structure contains information aboutthe meta object set (the MOS), which describes the root of a pool,child datasets, clones, snapshots, dedupe table, volumes, and the space map fora pool, among other things. There are also objset_phys_t
structures for each dataset, clone, volume, child dataset, andsnapshot which locates the objects (files, directories) within theobject set.
The block id identifies which block in the object is referenced by thedmu_buf_impl_t
, or the block id contains the string"bonus". The bonus buffer (a field in the dnode_phys_t
)contains attributes (ownership, timestamps, permissions, etc.) of anobject. Note that entries marked "bonus" havea NULL value for the arc_buf_t *
field in thedmu_buf_impl_t
. The bonus buffer is in the ARC, but isthere as part of the dnode_phys_t
for the object. Thebonus DMU buffers are copies of the data from the correspondingdnode_phys_t
. And the dnode_phys_t
thatcontains the bonus buffer is also in the DMU cache (and ARC).
The "holds" value says how many things are currently using the DMUbuffer. The buffer can not be freed if the hold count is non-zero.
- Let's look at the output of "
::dbufs -o 7f8 -n testpool|
:dbuf
".addr object lvl blkid holds os
ffffff00cf433e80 7f8 0 0 0 testpoolffffff00cf43d9f8 7f8 0 1 0 testpoolffffff00ce7a31b0 7f8 1 0 2 testpoolffffff00ce7b8860 7f8 0 bonus 1 testpool
For object id 0x7f8, there are 4 dbufs. The first is for the first block (128kbytes), and the second is for the second block in the file. The third one isfor a level 1 indirect block. It contains block pointers that containthe blocks described by the first two entries. The last one is for thebonus buffer for the file. The dnode_phys_t
describingthe file is in an "mdn" dbuf.
At this point, we get more information about the first dbuf.
> ffffff00cf433e80::print -t dmu_buf_impl_t db db_buf db_dnode_handledmu_buf_t db = { uint64_t db.db_object = 0x7f8 uint64_t db.db_offset = 0 uint64_t db.db_size = 0x20000 void *db.db_data = 0xffffff0043402000}arc_buf_t *db_buf = 0xffffff00ccdb0ee0struct dnode_handle *db_dnode_handle = 0xffffff00ccf520c8>
The db
member describes the buffer. Thedb.db_data
field is the address of where the bufferstarts in memory. Going to that address shows the first 128k of datafor the words file.
> ffffff0043402000,20000/c 1st <-- this is the beginning of the "words" file 2nd 3rd ...>
The arc_buf_t
contains a pointer to thearc_buf_hdr_t
for the buffer, which in turn shows thatthe buffer is in the ARC_mfu
cache. Note that theaddress of the buffer in the arc_buf_t
(b_data
) matches the db_data
field in thedmu_buf_impl_t
. The b_private
field in thearc_buf_t
is a pointer back to the dmu_buf_impl_t
.
Now let's look at the dnode_t
for the file.
> ffffff00cf433e80::print -t dmu_buf_impl_t db_dnode_handle | ::print -t dnode_handle_t dnh_dnode | ::print -t dnode_tdnode_t {... list_node_t dn_link = { <-- linked list of all dnodes on system ... } struct objset *dn_objset = 0xffffff00cd063040 uint64_t dn_object = 0x7f8 struct dmu_buf_impl *dn_dbuf = 0xffffff00ce7ae268 <-- dbuf for dnode_phys_t struct dnode_handle *dn_handle = 0xffffff00ccf520c8 dnode_phys_t *dn_phys = 0xffffff00cf8e8000 dmu_object_type_t dn_type = 0t19 (DMU_OT_PLAIN_FILE_CONTENTS) uint16_t dn_bonuslen = 0xa8 uint8_t dn_bonustype = 0x2c ... uint32_t dn_dbufs_count = 0x4 ... refcount_t dn_holds = { uint64_t rc_count = 0x4 } ... list_t dn_dbufs = { <-- dbufs with this dnode_t ... } struct dmu_buf_impl *dn_bonus = 0xffffff00ce7b8860 ...}
The dnode_t
contains a pointer to admu_buf_impl_t
.Let's look at this:
> ffffff00ce7ae268::dbuf addr object lvl blkid holds osffffff00ce7ae268 mdn 0 3f 1 testpool>
So, the dbuf that contains the dnode_phys_t
for the wordsfile is a meta dnode object, at indirect level 0 and at block id 0x3f.Let's take a closer look.
> ffffff00ce7ae268::print -t dmu_buf_impl_tdmu_buf_impl_t { dmu_buf_t db = { uint64_t db_object = 0 uint64_t db_offset = 0xfc000 uint64_t db_size = 0x4000 void *db_data = 0xffffff00cf8e5000 } struct objset *db_objset = 0xffffff00cd063040 struct dnode_handle *db_dnode_handle = 0xffffff00cd063060 struct dmu_buf_impl *db_parent = 0xffffff00ce795e48 struct dmu_buf_impl *db_hash_next = 0 uint64_t db_blkid = 0x3f blkptr_t *db_blkptr = 0xffffff00cf8f2f80 ... dbuf_states_t db_state = 4 (DB_CACHED) refcount_t db_holds = { uint64_t rc_count = 0x1 } arc_buf_t *db_buf = 0xffffff00ccdb0d30 ... void *db_user_ptr = 0xffffff00ccf51e80 ...}
This dbuf is for a 16k (0x4000) byte block at offset 0xfc000. Notethat the blkid (0x3f) times the block size (0x4000) gives the offsetof 0xfc000. This is a block of dnode_phys_t
structures.
> ::sizeof dnode_phys_tsizeof (dnode_phys_t) = 0x200>> 4000%200=K <-- block size is 0x4000, dnode_phys_t size is 0x200 20 <-- 32 dnode_phys_t / block>> 7f8%20=K <-- 0x748 is the object id for words 3f <-- matches the db_blkid> 3f*20=K <-- where does block containing 7f8 begin? 7e0> 7f8-7e0=K <-- get offset from beginning of block 18> ffffff00cf8e5000+(18*200)=K <-- get address of dnode_phys_t for words file ffffff00cf8e8000 <-- matches dn_phys in dnode_t above
If the ARC buffer is evicted, a callback (dbuf_do_evict()
)will clean up the dmu_buf_impl_t
. See the comment beforedbuf_clear()
in uts/common/fs/zfs/dbuf.c
for somedetails. Here is the same ::dbufs
command run as before,but after some ARC/dbuf evictions.
> ::dbufs -o 7f8 -n testpool | ::dbuf addr object lvl blkid holds osffffff00cf433e80 7f8 0 0 0 testpoolffffff00ce7a31b0 7f8 1 0 1 testpoolffffff00ce7b8860 7f8 0 bonus 1 testpool
So, one of the buffers (the one containing the second block of thefile) is no longer cached.
An interesting question to ask may be: For a given file, how muchof the file data/metadata is in ARC?
To do this, we'll use a file I have been intermittently looking atover time.
# ls -i /var/tmp/foo.out 1337 -rw-r--r-- 1 root root 13328871 Jan 6 09:56 /var/tmp/foo.out## mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm sppp nfs ]> ::dbufs -o 0t1337 | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'6832128 <-- decimal, about 6.8MB>
Maybe more interesting is how much ARC space a given dataset or volumeis using. The following shows total space in ARC used by the testpooldataset.
> ::dbufs -n testpool | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'411648>
Volumes are used in the Joyent Public Cloud (JPC) for kvm-basedvirtual machines. Let's get the space used by a Fedora instance.First, the list of virtual machines.
# vmadm listUUID TYPE RAM STATE ALIAS8d02acc6-a9cc-e033-f196-aa6841702872 OS 768 running vm19d3ccd6c-511b-60c5-db07-d742941bb62b KVM 1024 running ubuntu-1a5528066-171a-694f-85ef-cac9928c9fd3 OS 2048 running vm1dd6fb539-cfac-c84a-f336-d1232a6f673e OS 2048 running -62293978-3947-eb5a-dcdf-a6b4728b39bf KVM 8192 running maxfedoraed3f45b1-833a-438d-8214-3876a58d9371 OS 8192 runningmoray1
Volumes are more difficult as we cannot use the volume name with the"-n" option and "::dbufs". The following command line walks throughthe set of all dnode_t
on the system. For each one, itgets the type of the dnode_t
looking for type value of0x17 (DMU_OT_ZVOL
, e.g., volume). For dnode_t
of thattype, it prints out the name of the volume.
> ::walk dnode_t d | ::print dnode_t dn_phys | ::print dnode_phys_t dn_type | ::grep ".==17" | ::eval '
This is a long one-liner. Briefly, it walks the list of dnode_t
inmemory. For each dnode_t
, the walker stores the addressof the dnode_t
in an mdb variable ("d"). For eachdnode_t
, it prints the dn_phys
value(dn_phys
is the address of a dnode_phys_t
,which is an in-memory copy of the same data structure that is ondisk. The dn_type
field refers to the "type" of thednode_phys_t
. If the type is 0x17, the dnode_t
(and corresponding dnode_phys_t
), is for a ZFS volume.The "::eval" gets the value of the "d" variable (thednode_t
with type equal to 0x17) and prints thedn_objset
for that dnode. At the end, this one-linerwill list the names of all of the ZFS volumes currently on the system.
To find the amount of ARC space consumed by the "maxfedora" virtualmachine (uuid = 62293978-3947-eb5a-dcdf-a6b4728b39bf-disk1), we canfind all dbufs whose dnode handle takes us to thednode_t
for the volume. Using the above command, we want the firstdnode_t
.
> ::walk dnode_t d | ::print dnode_t dn_phys | ::print dnode_phys_t dn_type | ::grep ".==17" | ::eval '
The dnode_t
contains a list of all dbufs that are usedfor it. We'll walk the list of dbufs, and for each one that has a non-NULL arcbuf pointer, we'll get the size from the arc buf header and add themup. To walk the list of dbufs in the dnode_t
, we need toknow the address of the list.
> ::offsetof dnode_t dn_dbufsoffsetof (dnode_t, dn_dbufs) = 0x248, sizeof (...->dn_dbufs) = 0x20>
Adding the offset of the dn_dbufs
member to the addressof the dnode_t for the "62293978-3947-eb5a-dcdf-a6b4728b39bf-disk0"volume, we'll walk the list of dbufs for the volume. This volume isthe system disk for the "maxfedora" image.
> ffffff0dc5845710 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'216375296 <-- number of bytes mapped in ARC for the volume>
And here is the data disk (/data
) in the Fedora instance.
> ffffff11c639ebe0 +248::walk list | ::print -t dmu_buf_impl_t db_buf | ::grep ".!=0" | ::print -t arc_buf_t b_hdr | ::print -d -t arc_buf_hdr_t b_size ! sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'743317504>
As mentioned earlier, not all of what is in the ARC is mapped bydbufs. First of all, not all dbufs refer to arc buffers (thedb_buf
field in the dmu_buf_impl_t
can beNULL). All but 2 of these instances of the dbufs are to get quickaccess to bonus buffers. The bonus buffers are in the ARC, but aspart of the dnode_phys_t
which contains them. There aremany arc buffers that do not have a pointer back to a dbuf(b_private
in the arc_buf_t
is NULL). Ihave looked at some of these arc buffers and have found a fewdifferent types of metadata, but also some buffers which containdata. One possible reason for this is prefetch.
Here is a way to see all of ARC that is mapped by dbufs.
> ::walk dmu_buf_impl_t d| ::print dmu_buf_impl_t db_buf |::grep ".!=0" | ::eval "
And here is a way to see space in ARC.
> ::walk arc_buf_t | ::print -t arc_buf_t b_hdr | ::print -t -d arc_buf_hdr_t b_size !sed -e 's/uint64_t b_size = 0t//' | awk '{sum+=$1} END{print sum}'1236963840>
Note that this number matches closely with the size shown by:
> ::arc !grep sizesize = 1200 MB...>
Determining the cause of the difference (1181588480 for dbufs, and1236963840 for arc buffers) is left as an exercise for the reader.
All of this is very interesting, but also quite a few steps. It wouldbe nice to have an lsarc
command that lists what filesare in the arc, how much data/metadata is in arc for a givenfile/dataset/volume, a breakdown between data and metadata, and evenwhich arc cache the data is on (MRU or MFU). Once you understand thatthe dbufs provide a map of (most of) the arc, this command becomespossible.
Post written by Mr. Max Bruning