ZFS Forensics - Recovering Files From a Destroyed Zpool

Back in 2008, I wrote a post about recovering a removed file on a zfs disk. This post links to a paper here, (see page 36), and a set of slides here.

Over time, I have received email from various people asking for help either recovering files or pools or datasets, or for the tools I talk about in the blog post and the OpenSolaris Developers Conference in Prague in 2008. These tools were a modified mdb(1) and a modified zdb(1M).It is time to revisit that work.

In this post, I'll create a ZFS pool, add a file to the pool, destroy the pool, and then recover the file. To do this, I'll use a modified mdb, and a tool I wrote to uncompress ZFS compressed data/metadata (zuncompress). Since zdb does not seem to work with destroyed zpools (in fact, much of zdb does not work with pools that do not import), I will not be using it. The code for what I am using is available at mdbzfs. Please read the README file for instructions on how to set things up.

For those of you who are running ZFS on Linux, at the end of this blog post, I have a suggestion on how you might try this on your ZFS on Linux file system.

Before you try this on your own, please backup the disk(s) in question. Use the technique I am showing at your own risk. (Note that nothing I am doing should change any data in the zpool). If you are using a file the way I do here, there is of course no need to make a backup.

First, we'll create a zfs pool using a file, then add a file to the pool, then destroy the pool

# mkfile 100m /var/tmp/zfsfile# zpool create testpool /var/tmp/zfsfile# touch /testpool/foo# cp /usr/dict/words /testpool/words# sync# zpool destroy testpool#

Note that the first time I tried this, I did not do the sync. I create the pool, added the file, and destroyed the pool before zfs got around to committing the transactions to disk, resulting in the file not showing up.

The steps we'll take to get the words file back from the destroyed pool will start at the uberblock, and walk the (compressed) metadata structures on disk until we get to the file. If I (or someone else) ever get around to adding a "zfs on disk" target to mdb, this will be much simpler.

# mdb /var/tmp/zfsfile> ::walk uberblock u | ::print zfs`uberblock_t ub_txg ! sort -rub_txg = 0xeub_txg = 0xdub_txg = 0xcub_txg = 0xbub_txg = 0xaub_txg = 0x9ub_txg = 0x6ub_txg = 0x5ub_txg = 0x4ub_txg = 0x14ub_txg = 0x11ub_txg = 0ub_txg = 0...

The uberblock walker is in the rawzfs.so dmod (see the source on github). And I have added the following lines to ~/.mdbrc:

::loadctf    <-- gets kernel CTF info::load /root/rawzfs.so   <-- or wherever you put the rawzfs.so file::load /root/zfs.so  <-- or where you put zfs.so

The zfs.so and rawzfs.so files are built when you build mdb from my github repo. If yougmake world, you may not need to do the two loads. So, in this case, the highest transaction group id is 0x14. Note that I am making an assumption that this is the last active uberblock_t. If it doesn't work, try the next lowest id. Let's print out the uberblock_t for that transaction group id.

> ::walk uberblock u | ::print zfs`uberblock_t ub_txg | ::grep ".==14" | ::eval "

The rootbp blkptr_t in the above takes us to a objset_phys_t for the meta object set (MOS) for the pool. Let' look at that blkptr_t

> 25028::blkptrDVA[0]=<0:84800:200>DVA[1]=<0:1284800:200>DVA[2]=<0:2484800:200>[L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique triplesize=800L/200P birth=20L/20P fill=39cksum=126da42f4f:6be7bf74635:145b828e81ab7:2a37bf50847b59> $q#

So, there are 3 copies of the objset_phys_t specified by the blkptr, at 0x84800, 0x1284800, and at 0x2484800 bytes into the first (and only) vdev (the leading 0 in 0:84800:200). The three copies are compressed via lzjb compression. On disk, each is 0x200 bytes large. Decompressed, the objset_phys_t is 0x800 bytes. Currently, mdb has no way to decompress the data. We'll use the new tool zuncompress to uncompress the data into a file.

# ./zuncompress -p 200 -l 800 -o 84800 /var/tmp/zfsfile > /tmp/mos_objset#

The decompressed objset_phys_t is now in /tmp/mos_objset. Now we'll run mdbon the file to look at the objset_phys_t.

# mdb /tmp/mos_objset>0::print -a -t zfs`objset_phys_t0 objset_phys_t {    0 dnode_phys_t os_meta_dnode = {        0 uint8_t dn_type = 0xa        1 uint8_t dn_indblkshift = 0xe        2 uint8_t dn_nlevels = 0x1        3 uint8_t dn_nblkptr = 0x3    ...        40 blkptr_t [1] dn_blkptr = [            40 blkptr_t {                40 dva_t [3] blk_dva = [                    40 dva_t {                        40 uint64_t [2] dva_word = [ 0x5, 0x41f ]                    },                    50 dva_t {                        50 uint64_t [2] dva_word = [ 0x5, 0x941f ]                    },                    60 dva_t {                        60 uint64_t [2] dva_word = [ 0x5, 0x1241f ]                    },                ]                70 uint64_t blk_prop = 0x800a07030004001f                78 uint64_t [2] blk_pad = [ 0, 0 ]                88 uint64_t blk_phys_birth = 0                90 uint64_t blk_birth = 0x14                98 uint64_t blk_fill = 0x1f                a0 zio_cksum_t blk_cksum = {                    a0 uint64_t [4] zc_word = [ 0xbc335cdf82, 0xee3d3a7c1fc4,0xc355cf13639994, 0x78d0d2289454a408 ]                }            },        ]        c0 uint8_t [192] dn_bonus = [ 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x4, 0, 0, 0, 0, 0, 0, 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x94, 0, 0, 0, 0, 0, 0, ... ]...

Let's get the blkptr_t in the objset_phys_t. This will be either a block containing the dnode_phys_t for the meta objset set (MOS) for the pool, or an indirect block containing blkptr_ts which may contain the dnode_phys_t, or more indirect blocks.

> ::statusdebugging file '/tmp/objset' (object file)> 40::blkptrDVA[0]=<0:83e00:a00>DVA[1]=<0:1283e00:a00>DVA[2]=<0:2483e00:a00>[L0 DNODE] FLETCHER_4 LZJB LE contiguous unique triplesize=4000L/a00P birth=20L/20P fill=31cksum=bc335cdf82:ee3d3a7c1fc4:c355cf13639994:78d0d2289454a408> $q#

In this case, the blkptr is for a block containing the MOS (array of dnode_phys_t. (The L0 DNODE in the above output shows that there are 0 levels of indirection. A case where there are multiple levels of indirection from a blkptr_t will be shown below. We'll decompress the block.

# ./zuncompress -p a00 -l 4000 -o 83e00 /var/tmp/zfsfile > /tmp/mos#

As mentioned earlier, the MOS is an array of dnode_phys_t. The decompressed block is 0x4000 bytes large.

# mdb /tmp/mos> ::sizeof zfs`dnode_phys_tsizeof (zfs`dnode_phys_t) = 0x200> 4000%200=K                20>

There are 32 (0x20) entries in the array. Let's dump them.

> 0,20::print -a -t zfs`dnode_phys_t0 dnode_phys_t {    0 uint8_t dn_type = 0  <-- the first entry (id = 0) is not used    ...}200 dnode_phys_t {    200 uint8_t dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY    201 uint8_t dn_indblkshift = 0xe    202 uint8_t dn_nlevels = 0x1    203 uint8_t dn_nblkptr = 0x3    204 uint8_t dn_bonustype = 0    205 uint8_t dn_checksum = 0    206 uint8_t dn_compress = 0    207 uint8_t dn_flags = 0x1    208 uint16_t dn_datablkszsec = 0x2    20a uint16_t dn_bonuslen = 0    20c uint8_t [4] dn_pad2 = [ 0, 0, 0, 0 ]    210 uint64_t dn_maxblkid = 0    218 uint64_t dn_used = 0x600    220 uint64_t [4] dn_pad3 = [ 0, 0, 0, 0 ]    240 blkptr_t [1] dn_blkptr = [        240 blkptr_t {            240 dva_t [3] blk_dva = [                240 dva_t {                    240 uint64_t [2] dva_word = [ 0x1, 0x20 ]                },                250 dva_t {                    250 uint64_t [2] dva_word = [ 0x1, 0x9020 ]                },                260 dva_t {                    260 uint64_t [2] dva_word = [ 0x1, 0x12000 ]                },            ]            270 uint64_t blk_prop = 0x8001070300000001            278 uint64_t [2] blk_pad = [ 0, 0 ]            288 uint64_t blk_phys_birth = 0            290 uint64_t blk_birth = 0x4            298 uint64_t blk_fill = 0x1            2a0 zio_cksum_t blk_cksum = {                2a0 uint64_t [4] zc_word = [ 0xf38ae7fee, 0x6064734c9bd,0x13a8cd3126a75, 0x2bfdd306beb1a2 ]            }        },    ]    ...

An "object directory" (DMU_OT_OBJECT_DIRECTORY) is a "ZAP" object containing information about the meta objects. Meta objects in the MOS include the root of the pool, snapshots, clones, the space map, and other information. The ZAP object is contained in the data specified by the blkptr_t at location 0x240 in the above output.

> 240::blkptrDVA[0]=<0:4000:200>DVA[1]=<0:1204000:200>DVA[2]=<0:2400000:200>[L0 OBJECT_DIRECTORY] FLETCHER_4 LZJB LE contiguous unique triplesize=400L/200P birth=4L/4P fill=1cksum=f38ae7fee:6064734c9bd:13a8cd3126a75:2bfdd306beb1a2> $q#

Let's decompress and look at the ZAP.

# ./zuncompress -p 200 -l 400 -o 4000 /var/tmp/zfsfile > /tmp/objdir# mdb /tmp/objdir> 0/K

0: 8000000000000003

The 8000000000000003 tells us this is a microzap (as opposed to a "fat ZAP". Fat zaps are used when the amount of data in the ZAP exceeds 1 block (hence needs indirect blocks).

> 0::print -a -t zfs`mzap_phys_t0 mzap_phys_t {    0 uint64_t mz_block_type = 0x8000000000000003    8 uint64_t mz_salt = 0x16c04723    10 uint64_t mz_normflags = 0    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]    40 mzap_ent_phys_t [1] mz_chunk = [        40 mzap_ent_phys_t {            40 uint64_t mze_value = 0x2            48 uint32_t mze_cd = 0            4c uint16_t mze_pad = 0            4e char [50] mze_name = [ "root_dataset" ]        },    ]}> $q#

There are more entries, but this is the entry we want (the "root_dataset"). The value of 2 for mze_value is an object id. Basically, an index into the MOS array of dnode_phys_ts where the root dataset is described.

# mdb /tmp/mos> 2*200::print -a -t zfs`dnode_phys_t  <-- get the entry at index 2400 dnode_phys_t {    <-- each dnode_phys_t is 0x200 bytes    400 uint8_t dn_type = 0xc    401 uint8_t dn_indblkshift = 0xe    402 uint8_t dn_nlevels = 0x1    403 uint8_t dn_nblkptr = 0x1    404 uint8_t dn_bonustype = 0xc    405 uint8_t dn_checksum = 0    406 uint8_t dn_compress = 0    407 uint8_t dn_flags = 0    408 uint16_t dn_datablkszsec = 0x1    40a uint16_t dn_bonuslen = 0x100    40c uint8_t [4] dn_pad2 = [ 0, 0, 0, 0 ]    410 uint64_t dn_maxblkid = 0    418 uint64_t dn_used = 0    420 uint64_t [4] dn_pad3 = [ 0, 0, 0, 0 ]    440 blkptr_t [1] dn_blkptr = [        440 blkptr_t {            440 dva_t [3] blk_dva = [                440 dva_t {                    440 uint64_t [2] dva_word = [ 0, 0 ]                },                450 dva_t {                    450 uint64_t [2] dva_word = [ 0, 0 ]                },                460 dva_t {                    460 uint64_t [2] dva_word = [ 0, 0 ]                },            ]            470 uint64_t blk_prop = 0            478 uint64_t [2] blk_pad = [ 0, 0 ]            488 uint64_t blk_phys_birth = 0            490 uint64_t blk_birth = 0            498 uint64_t blk_fill = 0            4a0 zio_cksum_t blk_cksum = {                4a0 uint64_t [4] zc_word = [ 0, 0, 0, 0 ]            }        },    ]    4c0 uint8_t [192] dn_bonus = [ 0x9a, 0xbf, 0x3, 0x52, 0, 0, 0, 0, 0x15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x12, 0, 0, 0, 0, 0, 0, 0, ... ]...

Here, the blkptr_t is not used. Instead, the information we need is in the "bonus buffer" (dn_bonus at offset 0x4c0).

> 4c0::print -a -t zfs`dsl_dir_phys_t4c0 dsl_dir_phys_t {    4c0 uint64_t dd_creation_time = 0x5203bf9a    4c8 uint64_t dd_head_dataset_obj = 0x15    4d0 uint64_t dd_parent_obj = 0    4d8 uint64_t dd_origin_obj = 0x12    4e0 uint64_t dd_child_dir_zapobj = 0x4    4e8 uint64_t dd_used_bytes = 0x5d400    4f0 uint64_t dd_compressed_bytes = 0x4b200    4f8 uint64_t dd_uncompressed_bytes = 0x4b200    500 uint64_t dd_quota = 0    508 uint64_t dd_reserved = 0    510 uint64_t dd_props_zapobj = 0x3    518 uint64_t dd_deleg_zapobj = 0    520 uint64_t dd_flags = 0x1    528 uint64_t [5] dd_used_breakdown = [ 0x48400, 0, 0x15000, 0, 0 ]    550 uint64_t dd_clones = 0    558 uint64_t dd_filesystem_count = 0    560 uint64_t dd_snapshot_count = 0    568 uint64_t [11] dd_pad = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]}

From here, we'll go to the dd_head_dataset_obj, 0x15.

> 15*200::print -a -t zfs`dnode_phys_t2a00 dnode_phys_t {    2a00 uint8_t dn_type = 0x10  <-- DMU_OT_DSL_DATASET    ...    2a40 blkptr_t [1] dn_blkptr = [        2a40 blkptr_t {            2a40 dva_t [3] blk_dva = [                2a40 dva_t {                    2a40 uint64_t [2] dva_word = [ 0, 0 ]                },    ...    2ac0 uint8_t [192] dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0x12, 0, 0, 0, 0,0, 0, 0, 0x1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]...

The data for the DMU_OT_DSL_DATASET is in the bonus buffer. Let's dump that out.

> 2ac0::print -a -t zfs`dsl_dataset_phys_t2ac0 dsl_dataset_phys_t {    2ac0 uint64_t ds_dir_obj = 0x2    ...    2b40 blkptr_t ds_bp = {        2b40 dva_t [3] blk_dva = [            2b40 dva_t {                2b40 uint64_t [2] dva_word = [ 0x1, 0x2d6 ]            },            2b50 dva_t {                2b50 uint64_t [2] dva_word = [ 0x1, 0x90d6 ]            },            2b60 dva_t {                2b60 uint64_t [2] dva_word = [ 0, 0 ]            },        ]     ...

And look at the blkptr_t.

> 2b40::blkptrDVA[0]=<0:5ac00:200>DVA[1]=<0:121ac00:200>[L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique doublesize=800L/200P birth=11L/11P fill=9cksum=15955ae455:7d0aed4c6f5:17b63dc48793f:3202bc4dfa3b58> $q#

This is another objset_phys_t, this time for the root dataset instead of the MOS. We'll decompress and take a look.

# ./zuncompress -p 200 -l 800 -o 5ac00 /var/tmp/zfsfile > /tmp/ds_objset# mdb /tmp/ds_objset> 0::print -a -t zfs`objset_phys_t0 objset_phys_t {    0 dnode_phys_t os_meta_dnode = {        0 uint8_t dn_type = 0xa  <-- DMU_OT_DNODE        ...        40 blkptr_t [1] dn_blkptr = [            40 blkptr_t {                40 dva_t [3] blk_dva = [                    40 dva_t {                        40 uint64_t [2] dva_word = [ 0x2, 0x2d1 ]                    },                    50 dva_t {                        50 uint64_t [2] dva_word = [ 0x2, 0x90d1 ]                    },                    60 dva_t {                        60 uint64_t [2] dva_word = [ 0, 0 ]                    },                ]         ...

Grabbing the blkptr_t as was the case for the MOS objset.

> 40::blkptrDVA[0]=<0:5a200:400>DVA[1]=<0:121a200:400>[L6 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=5a33c7bab6:3e5fa32d9ea0:16d3626ce1ceee:5d94da91be37c8d> $q#

For the dataset object set, there are 2 copies of the metadata (unlike the three copies for the MOS). And the "L6" says there are 6 levels of indirection. Indirect blocks are blocks containing blkptr_ts of block containing block pointers... of blocks containing data. In this case, 6 levels deep. We'll look at the first blkptr_t in each of these. Note that if this was a large file system with lots of data, we would probably still need the beginning (root of the file system) to get started. In this particular case, the only blkptr_t being used in all of the indirect blocks is the first one. The rest are "holes" (placeholders for when/if the file system has more objects). Given an object id, the arithmetic needed to find the correct path through the indirect blocks for that object id is covered in the papers mentioned at the beginning of this post.

At this point we'll follow a sequence of decompressing and following the block pointers until we get to level 0 (the dnode_phys_t array for the objects in the (root) dataset).

# ./zuncompress -p 400 -l 4000 -o 5a200 /var/tmp/zfsfile > /tmp/l6_dnode# mdb /tmp/l6_dnode> 0::blkptrDVA[0]=<0:59e00:400>DVA[1]=<0:1219e00:400>[L5 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=5a4813c63a:3e6e4b21ab12:16d82a6aab7196:5da1fa71471b3a2> $q## ./zuncompress -p 400 -l 4000 -o 59e00 /var/tmp/zfsfile > /tmp/l5_dnode# mdb /tmp/l5_dnode> 0::blkptrDVA[0]=<0:59a00:400>DVA[1]=<0:1219a00:400>[L4 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=5a07a23ca3:3e2ff2ae47d2:16b9e360815f88:5d048405cb59ba5> $q## ./zuncompress -p 400 -l 4000 -o 59a00 /var/tmp/zfsfile > /tmp/l4_dnode# mdb /tmp/l4_dnode> 0::blkptrDVA[0]=<0:59600:400>DVA[1]=<0:1219600:400>[L3 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=594127027c:3d7854dc4336:1664aa2337fdfd:5b5d2ad4907d3f2> $q## ./zuncompress -p 400 -l 4000 -o 59600 /var/tmp/zfsfile > /tmp/l3_dnode# mdb /tmp/l3_dnode> 0::blkptrDVA[0]=<0:59200:400>DVA[1]=<0:1219200:400>[L2 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=5a6c9eaf90:3e918a332bce:16e93bc40842e1:5dfa7ee35affc19> $q## ./zuncompress -p 400 -l 4000 -o 59200 /var/tmp/zfsfile > /tmp/l2_dnode# mdb /tmp/l2_dnode> 0::blkptrDVA[0]=<0:58e00:400>DVA[1]=<0:1218e00:400>[L1 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=11L/11P fill=9cksum=573ebf43bc:3c03ae3ccfbe:15cc559f914849:58921efeca0c341> $q## ./zuncompress -p 400 -l 4000 -o 58e00 /var/tmp/zfsfile > /tmp/l1_dnode# mdb /tmp/l1_dnode> 0::blkptrDVA[0]=<0:58800:600>DVA[1]=<0:1218800:600>[L0 DNODE] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/600P birth=11L/11P fill=9cksum=87a454f048:6092818ca2a1:2d0688f6b70082:104b023c565fb938> $q## ./zuncompress -p 600 -l 4000 -o 58800 /var/tmp/zfsfile > /tmp/dnodes#

Now we're at level 0. This is an array of dnode_phys_t for files and directories in the root of the ZFS file system. Let's dump the array.

# mdb /tmp/dnodes>0,20::print -a -t zfs`dnode_phys_t0 dnode_phys_t {    0 uint8_t dn_type = 0  <-- first entry (obj id = 0) is unused...200 dnode_phys_t {    200 uint8_t dn_type = 0x15  <-- DMU_OT_MASTER_NODE    ...    240 blkptr_t [1] dn_blkptr = [        240 blkptr_t {            240 dva_t [3] blk_dva = [                240 dva_t {                    240 uint64_t [2] dva_word = [ 0x1, 0 ]                },                250 dva_t {                    250 uint64_t [2] dva_word = [ 0x1, 0x9000 ]                },                260 dva_t {                    260 uint64_t [2] dva_word = [ 0, 0 ]                },            ]            ...

The second entry is the "master node" for the file system. Let's look at the blkptr_t

> 240::blkptrDVA[0]=<0:0:200>DVA[1]=<0:1200000:200>[L0 MASTER_NODE] FLETCHER_4 LZJB LE contiguous unique doublesize=400L/200P birth=4L/4P fill=1cksum=a23da9de2:4526f62c71b:f0b3b5fb1f03:239ad9c427b988> $q#

This is another ZAP block. We'll decompress and take a look.

# ./zuncompress -p 200 -l 400 -o 0 /var/tmp/zfsfile > /tmp/master# mdb /tmp/master> 0/K0:              8000000000000003> 0::print -a -t zfs`mzap_phys_t0 mzap_phys_t {    0 uint64_t mz_block_type = 0x8000000000000003    8 uint64_t mz_salt = 0x16d68b53    10 uint64_t mz_normflags = 0    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]    40 mzap_ent_phys_t [1] mz_chunk = [        40 mzap_ent_phys_t {            40 uint64_t mze_value = 0            48 uint32_t mze_cd = 0            4c uint16_t mze_pad = 0            4e char [50] mze_name = [ "normalization" ]        },    ]}>

Let's look at additional entries in the ZAP object. We want an entry for "ROOT".

> .::print -a -t zfs`mzap_ent_phys_t  <-- "." says continue where we left off80 mzap_ent_phys_t {    80 uint64_t mze_value = 0    88 uint32_t mze_cd = 0    8c uint16_t mze_pad = 0    8e char [50] mze_name = [ "utf8only" ]}> .::print -a -t zfs`mzap_ent_phys_tc0 mzap_ent_phys_t {    c0 uint64_t mze_value = 0    c8 uint32_t mze_cd = 0    cc uint16_t mze_pad = 0    ce char [50] mze_name = [ "casesensitivity" ]}> .::print -a -t zfs`mzap_ent_phys_t100 mzap_ent_phys_t {    100 uint64_t mze_value = 0x5    108 uint32_t mze_cd = 0    10c uint16_t mze_pad = 0    10e char [50] mze_name = [ "VERSION" ]}> .::print -a -t zfs`mzap_ent_phys_t140 mzap_ent_phys_t {    140 uint64_t mze_value = 0x2    148 uint32_t mze_cd = 0    14c uint16_t mze_pad = 0    14e char [50] mze_name = [ "SA_ATTRS" ]}> .::print -a -t zfs`mzap_ent_phys_t180 mzap_ent_phys_t {    180 uint64_t mze_value = 0x3    188 uint32_t mze_cd = 0    18c uint16_t mze_pad = 0    18e char [50] mze_name = [ "DELETE_QUEUE" ]}> .::print -a -t zfs`mzap_ent_phys_t1c0 mzap_ent_phys_t {    1c0 uint64_t mze_value = 0x4    1c8 uint32_t mze_cd = 0    1cc uint16_t mze_pad = 0    1ce char [50] mze_name = [ "ROOT" ]}> $q#

The root directory for the file system is object id 4 (mze_value from above. This is the 5th entry (starts at 0) in the array of dnode_phys_t for the file system. Let's take a look.

> ::statusdebugging file '/tmp/dnodes' (object file)> 4*200::print -a -t zfs`dnode_phys_t800 dnode_phys_t {    800 uint8_t dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS    ...    840 blkptr_t [1] dn_blkptr = [        840 blkptr_t {            840 dva_t [3] blk_dva = [                840 dva_t {                    840 uint64_t [2] dva_word = [ 0x1, 0x2c3 ]                },                850 dva_t {                    850 uint64_t [2] dva_word = [ 0x1, 0x90c3 ]                },                860 dva_t {                    860 uint64_t [2] dva_word = [ 0, 0 ]                },            ]...

Directories are ZAP objects. We'll dump the blkptr_t, decompress if necessary, and find the words file that we copied into the file system at the beginning of this post.

> 840::blkptrDVA[0]=<0:58600:200>DVA[1]=<0:1218600:200>[L0 DIRECTORY_CONTENTS] FLETCHER_4 OFF LE contiguous unique doublesize=200L/200P birth=11L/11P fill=1cksum=27626ee8e:109d3b9097a:395d35f5c237:8703c96b7bd4c> $q#

Notice that compression is turned off, and there are no indirect blocks ("L0").

# mdb /var/tmp/zfsfile> 400000+58600::print -a -t zfs`mzap_phys_t458600 mzap_phys_t {    458600 uint64_t mz_block_type = 0x8000000000000003    458608 uint64_t mz_salt = 0x16d68999    458610 uint64_t mz_normflags = 0    458618 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]    458640 mzap_ent_phys_t [1] mz_chunk = [        458640 mzap_ent_phys_t {            458640 uint64_t mze_value = 0x8000000000000008            458648 uint32_t mze_cd = 0            45864c uint16_t mze_pad = 0            45864e char [50] mze_name = [ "foo" ]        },    ]}> .::print -a -t zfs`mzap_ent_phys_t458680 mzap_ent_phys_t {    458680 uint64_t mze_value = 0x8000000000000009    458688 uint32_t mze_cd = 0    45868c uint16_t mze_pad = 0    45868e char [50] mze_name = [ "words" ]}> $q#

The "words" file is at object id 9. Let's look at that dnode_phys_t.

# mdb /tmp/dnodes> 9*200::print -a -t zfs`dnode_phys_t1200 dnode_phys_t {    1200 uint8_t dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS    ...    1240 blkptr_t [1] dn_blkptr = [        1240 blkptr_t {            1240 dva_t [3] blk_dva = [                1240 dva_t {                    1240 uint64_t [2] dva_word = [ 0x2, 0x27f ]                },                1250 dva_t {                    1250 uint64_t [2] dva_word = [ 0x2, 0x907f ]                },                1260 dva_t {                    1260 uint64_t [2] dva_word = [ 0, 0 ]                },            ]     ...

Let's look at the blkptr_t.

> 1240::blkptrDVA[0]=<0:4fe00:400>DVA[1]=<0:120fe00:400>[L1 PLAIN_FILE_CONTENTS] FLETCHER_4 LZJB LE contiguous unique doublesize=4000L/400P birth=9L/9P fill=2cksum=5d1e925d95:3ed351070323:16995992c8e96c:5b9701a2a4ef414> $q#

This is a single indirect block (L1 in the above output. This makes sense as the size of the words file is ~256K. We'll decompress and look at the resulting blkptr_ts.

# ./zuncompress -p 400 -l 4000 -o 4fe00 /var/tmp/zfsfile > /tmp/l1_file# mdb /tmp/l1_file> 0::blkptrDVA[0]=<0:fe00:20000>[L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique singlesize=20000L/20000P birth=9L/9P fill=1cksum=2f6c9bcce37c:bd82a253b632bb1:acb0037ee619745c:5e7c6fc8adcedccd> 80::blkptrDVA[0]=<0:2fe00:20000>[L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique singlesize=20000L/20000P birth=9L/9P fill=1cksum=1bae53745c3f:9dc2421d31452d3:658d66823cf4fb0:11c158edbbfcc0f3> 100::blkptr> $q#

Now we'll go to the location specified by these block pointers to get our data.

# mdb /var/tmp/zfsfile> 400000+fe00,20000/c0x40fe00:       10th                1st                2nd                3rd                4th                5th                6th                7th                8th                9th                a                AAA                AAAS                Aarhus                Aaron                AAU                ABA                Ababa                aback            ...

And there is the contents of the first 128KB of the file. The remainder of the file is at the block specifed by the blkptr_t at offset 80 in the ::blkptr output.

If this were binary, it is simple enough to use dd(1M), seek to the correct location on the device, and dump from there. For instance,

> (400000+fe00)%200=E                8319> 20000%200=E                256# dd if=/var/tmp/zfsfile iseek=8319 bs=512 count=25610th1st2nd3rd4th5th6th7th8th9thaAAAAAAS...

That's a lot of work. Is there a way to just "see" all of the information? Yes, it's called zdb(1M). But zdb is not interative, and it does not work with destroyed pools (or pools that won't import). Also, I find that using mdb this way forces you to understand the on-disk format. For me, much preferable to having it all done for me.

I mentioned at the beginning of this post that it will only work on illumos-based systems, i.e., systems with mdb. I cannot include Solaris 11 or newer because there is no way to build mdb without source code. But what if you are using ZFS on Linux?

You could upload your devices (or files) as files to manta, along with the modified mdb, the zfs.so and rawzfs.so modules, and the zuncompress program. Then you use mlogin to log into the manta instance and try from there. I've included built copies of mdb, the modules, and zuncompress in the github repo. Note that I have not yet tried this, but it will likely be in a blog post in the next week or so.

Have fun!



Post written by Mr. Max Bruning