Bruning Questions: ZFS Forensics - Recovering Files From a Destroyed Zpool

August 12, 2013 - by Mr. Max Bruning

Back in 2008, I wrote a post about recovering a removed file on a zfs disk. This post links to a paper here, (see page 36), and a set of slides here.

Over time, I have received email from various people asking for help either recovering files or pools or datasets, or for the tools I talk about in the blog post and the OpenSolaris Developers Conference in Prague in 2008. These tools were a modified mdb(1) and a modified zdb(1M). It is time to revisit that work.

In this post, I'll create a ZFS pool, add a file to the pool, destroy the pool, and then recover the file. To do this, I'll use a modified mdb, and a tool I wrote to uncompress ZFS compressed data/metadata (zuncompress). Since zdb does not seem to work with destroyed zpools (in fact, much of zdb does not work with pools that do not import), I will not be using it. The code for what I am using is available at mdbzfs. Please read the README file for instructions on how to set things up.

For those of you who are running ZFS on Linux, at the end of this blog post, I have a suggestion on how you might try this on your ZFS on Linux file system.

Before you try this on your own, please backup the disk(s) in question. Use the technique I am showing at your own risk. (Note that nothing I am doing should change any data in the zpool). If you are using a file the way I do here, there is of course no need to make a backup.

First, we'll create a zfs pool using a file, then add a file to the pool, then destroy the pool

# mkfile 100m /var/tmp/zfsfile
# zpool create testpool /var/tmp/zfsfile
# touch /testpool/foo
# cp /usr/dict/words /testpool/words
# sync
# zpool destroy testpool
#

Note that the first time I tried this, I did not do the sync. I create the pool, added the file, and destroyed the pool before zfs got around to committing the transactions to disk, resulting in the file not showing up.

The steps we'll take to get the words file back from the destroyed pool will start at the uberblock, and walk the (compressed) metadata structures on disk until we get to the file. If I (or someone else) ever get around to adding a "zfs on disk" target to mdb, this will be much simpler.

# mdb /var/tmp/zfsfile
> ::walk uberblock u | ::print zfs`uberblock_t ub_txg ! sort -r
ub_txg = 0xe
ub_txg = 0xd
ub_txg = 0xc
ub_txg = 0xb
ub_txg = 0xa
ub_txg = 0x9
ub_txg = 0x6
ub_txg = 0x5
ub_txg = 0x4
ub_txg = 0x14
ub_txg = 0x11
ub_txg = 0
ub_txg = 0
...

The uberblock walker is in the rawzfs.so dmod (see the source on github). And I have added the following lines to ~/.mdbrc:

::loadctf    <-- gets kernel CTF info
::load /root/rawzfs.so   <-- or wherever you put the rawzfs.so file
::load /root/zfs.so  <-- or where you put zfs.so

The zfs.so and rawzfs.so files are built when you build mdb from my github repo. If you gmake world, you may not need to do the two loads. So, in this case, the highest transaction group id is 0x14. Note that I am making an assumption that this is the last active uberblock_t. If it doesn't work, try the next lowest id. Let's print out the uberblock_t for that transaction group id.

> ::walk uberblock u | ::print zfs`uberblock_t ub_txg | ::grep ".==14" | ::eval "<u::print -a zfs`uberblock_t"
25000 {
    25000 ub_magic = 0xbab10c
    25008 ub_version = 0x1388
    25010 ub_txg = 0x14
    25018 ub_guid_sum = 0x22807e13b4464086
    25020 ub_timestamp = 0x5203bfc9
    25028 ub_rootbp = {
        25028 blk_dva = [
            25028 {
                25028 dva_word = [ 0x1, 0x424 ]
            },
            25038 {
                25038 dva_word = [ 0x1, 0x9424 ]
            },
            25048 {
                25048 dva_word = [ 0x1, 0x12424 ]
            },
        ]
        25058 blk_prop = 0x800b070300000003
        25060 blk_pad = [ 0, 0 ]
        25070 blk_phys_birth = 0
        25078 blk_birth = 0x14
        25080 blk_fill = 0x27
        25088 blk_cksum = {
            25088 zc_word = [ 0x126da42f4f, 0x6be7bf74635, 0x145b828e81ab7, 
0x2a37bf50847b59 ]
        }
    }
    250a8 ub_software_version = 0x1388
}

The rootbp blkptr_t in the above takes us to a objset_phys_t for the meta object set (MOS) for the pool. Let' look at that blkptr_t

> 25028::blkptr
DVA[0]=<0:84800:200>
DVA[1]=<0:1284800:200>
DVA[2]=<0:2484800:200>
[L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique triple
size=800L/200P birth=20L/20P fill=39
cksum=126da42f4f:6be7bf74635:145b828e81ab7:2a37bf50847b59
> $q
#

So, there are 3 copies of the objset_phys_t specified by the blkptr, at 0x84800, 0x1284800, and at 0x2484800 bytes into the first (and only) vdev (the leading 0 in 0:84800:200). The three copies are compressed via lzjb compression. On disk, each is 0x200 bytes large. Decompressed, the objset_phys_t is 0x800 bytes. Currently, mdb has no way to decompress the data. We'll use the new tool zuncompress to uncompress the data into a file.

# ./zuncompress -p 200 -l 800 -o 84800 /var/tmp/zfsfile > /tmp/mos_objset
#

The decompressed objset_phys_t is now in /tmp/mos_objset. Now we'll run mdbon the file to look at the objset_phys_t.

# mdb /tmp/mos_objset
>0::print -a -t zfs`objset_phys_t
0 objset_phys_t {
    0 dnode_phys_t os_meta_dnode = {
        0 uint8_t dn_type = 0xa
        1 uint8_t dn_indblkshift = 0xe
        2 uint8_t dn_nlevels = 0x1
        3 uint8_t dn_nblkptr = 0x3
    ...
        40 blkptr_t [1] dn_blkptr = [
            40 blkptr_t {
                40 dva_t [3] blk_dva = [
                    40 dva_t {
                        40 uint64_t [2] dva_word = [ 0x5, 0x41f ]
                    },
                    50 dva_t {
                        50 uint64_t [2] dva_word = [ 0x5, 0x941f ]
                    },
                    60 dva_t {
                        60 uint64_t [2] dva_word = [ 0x5, 0x1241f ]
                    },
                ]
                70 uint64_t blk_prop = 0x800a07030004001f
                78 uint64_t [2] blk_pad = [ 0, 0 ]
                88 uint64_t blk_phys_birth = 0
                90 uint64_t blk_birth = 0x14
                98 uint64_t blk_fill = 0x1f
                a0 zio_cksum_t blk_cksum = {
                    a0 uint64_t [4] zc_word = [ 0xbc335cdf82, 0xee3d3a7c1fc4, 
0xc355cf13639994, 0x78d0d2289454a408 ]
                }
            },
        ]
        c0 uint8_t [192] dn_bonus = [ 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x4, 0, 0,
 0, 0, 0, 0, 0x3, 0, 0, 0, 0, 0, 0, 0, 0x2b, 0x94, 0, 0, 0, 0, 0, 0, ... ]
...

Let's get the blkptr_t in the objset_phys_t. This will be either a block containing the dnode_phys_t for the meta objset set (MOS) for the pool, or an indirect block containing blkptr_ts which may contain the dnode_phys_t, or more indirect blocks.

> ::status
debugging file '/tmp/objset' (object file)
> 40::blkptr
DVA[0]=<0:83e00:a00>
DVA[1]=<0:1283e00:a00>
DVA[2]=<0:2483e00:a00>
[L0 DNODE] FLETCHER_4 LZJB LE contiguous unique triple
size=4000L/a00P birth=20L/20P fill=31
cksum=bc335cdf82:ee3d3a7c1fc4:c355cf13639994:78d0d2289454a408
> $q
#

In this case, the blkptr is for a block containing the MOS (array of dnode_phys_t. (The L0 DNODE in the above output shows that there are 0 levels of indirection. A case where there are multiple levels of indirection from a blkptr_t will be shown below. We'll decompress the block.

# ./zuncompress -p a00 -l 4000 -o 83e00 /var/tmp/zfsfile > /tmp/mos
#

As mentioned earlier, the MOS is an array of dnode_phys_t. The decompressed block is 0x4000 bytes large.

# mdb /tmp/mos
> ::sizeof zfs`dnode_phys_t
sizeof (zfs`dnode_phys_t) = 0x200
> 4000%200=K
                20
>

There are 32 (0x20) entries in the array. Let's dump them.

> 0,20::print -a -t zfs`dnode_phys_t
0 dnode_phys_t {
    0 uint8_t dn_type = 0  <-- the first entry (id = 0) is not used
    ...
}
200 dnode_phys_t {
    200 uint8_t dn_type = 0x1 <-- DMU_OT_OBJECT_DIRECTORY
    201 uint8_t dn_indblkshift = 0xe
    202 uint8_t dn_nlevels = 0x1
    203 uint8_t dn_nblkptr = 0x3
    204 uint8_t dn_bonustype = 0
    205 uint8_t dn_checksum = 0
    206 uint8_t dn_compress = 0
    207 uint8_t dn_flags = 0x1
    208 uint16_t dn_datablkszsec = 0x2
    20a uint16_t dn_bonuslen = 0
    20c uint8_t [4] dn_pad2 = [ 0, 0, 0, 0 ]
    210 uint64_t dn_maxblkid = 0
    218 uint64_t dn_used = 0x600
    220 uint64_t [4] dn_pad3 = [ 0, 0, 0, 0 ]
    240 blkptr_t [1] dn_blkptr = [
        240 blkptr_t {
            240 dva_t [3] blk_dva = [
                240 dva_t {
                    240 uint64_t [2] dva_word = [ 0x1, 0x20 ]
                },
                250 dva_t {
                    250 uint64_t [2] dva_word = [ 0x1, 0x9020 ]
                },
                260 dva_t {
                    260 uint64_t [2] dva_word = [ 0x1, 0x12000 ]
                },
            ]
            270 uint64_t blk_prop = 0x8001070300000001
            278 uint64_t [2] blk_pad = [ 0, 0 ]
            288 uint64_t blk_phys_birth = 0
            290 uint64_t blk_birth = 0x4
            298 uint64_t blk_fill = 0x1
            2a0 zio_cksum_t blk_cksum = {
                2a0 uint64_t [4] zc_word = [ 0xf38ae7fee, 0x6064734c9bd, 
0x13a8cd3126a75, 0x2bfdd306beb1a2 ]
            }
        },
    ]
    ...

An "object directory" (DMU_OT_OBJECT_DIRECTORY) is a "ZAP" object containing information about the meta objects. Meta objects in the MOS include the root of the pool, snapshots, clones, the space map, and other information. The ZAP object is contained in the data specified by the blkptr_t at location 0x240 in the above output.

> 240::blkptr
DVA[0]=<0:4000:200>
DVA[1]=<0:1204000:200>
DVA[2]=<0:2400000:200>
[L0 OBJECT_DIRECTORY] FLETCHER_4 LZJB LE contiguous unique triple
size=400L/200P birth=4L/4P fill=1
cksum=f38ae7fee:6064734c9bd:13a8cd3126a75:2bfdd306beb1a2
> $q
#

Let's decompress and look at the ZAP.

# ./zuncompress -p 200 -l 400 -o 4000 /var/tmp/zfsfile > /tmp/objdir
# mdb /tmp/objdir
> 0/K

0: 8000000000000003

The 8000000000000003 tells us this is a microzap (as opposed to a "fat ZAP". Fat zaps are used when the amount of data in the ZAP exceeds 1 block (hence needs indirect blocks).

> 0::print -a -t zfs`mzap_phys_t
0 mzap_phys_t {
    0 uint64_t mz_block_type = 0x8000000000000003
    8 uint64_t mz_salt = 0x16c04723
    10 uint64_t mz_normflags = 0
    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    40 mzap_ent_phys_t [1] mz_chunk = [
        40 mzap_ent_phys_t {
            40 uint64_t mze_value = 0x2
            48 uint32_t mze_cd = 0
            4c uint16_t mze_pad = 0
            4e char [50] mze_name = [ "root_dataset" ]
        },
    ]
}
> $q
#

There are more entries, but this is the entry we want (the "root_dataset"). The value of 2 for mze_value is an object id. Basically, an index into the MOS array of dnode_phys_ts where the root dataset is described.

# mdb /tmp/mos
> 2*200::print -a -t zfs`dnode_phys_t  <-- get the entry at index 2
400 dnode_phys_t {    <-- each dnode_phys_t is 0x200 bytes
    400 uint8_t dn_type = 0xc
    401 uint8_t dn_indblkshift = 0xe
    402 uint8_t dn_nlevels = 0x1
    403 uint8_t dn_nblkptr = 0x1
    404 uint8_t dn_bonustype = 0xc
    405 uint8_t dn_checksum = 0
    406 uint8_t dn_compress = 0
    407 uint8_t dn_flags = 0
    408 uint16_t dn_datablkszsec = 0x1
    40a uint16_t dn_bonuslen = 0x100
    40c uint8_t [4] dn_pad2 = [ 0, 0, 0, 0 ]
    410 uint64_t dn_maxblkid = 0
    418 uint64_t dn_used = 0
    420 uint64_t [4] dn_pad3 = [ 0, 0, 0, 0 ]
    440 blkptr_t [1] dn_blkptr = [
        440 blkptr_t {
            440 dva_t [3] blk_dva = [
                440 dva_t {
                    440 uint64_t [2] dva_word = [ 0, 0 ]
                },
                450 dva_t {
                    450 uint64_t [2] dva_word = [ 0, 0 ]
                },
                460 dva_t {
                    460 uint64_t [2] dva_word = [ 0, 0 ]
                },
            ]
            470 uint64_t blk_prop = 0
            478 uint64_t [2] blk_pad = [ 0, 0 ]
            488 uint64_t blk_phys_birth = 0
            490 uint64_t blk_birth = 0
            498 uint64_t blk_fill = 0
            4a0 zio_cksum_t blk_cksum = {
                4a0 uint64_t [4] zc_word = [ 0, 0, 0, 0 ]
            }
        },
    ]
    4c0 uint8_t [192] dn_bonus = [ 0x9a, 0xbf, 0x3, 0x52, 0, 0, 0, 0, 0x15, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0x12, 0, 0, 0, 0, 0, 0, 0, ... ]
...

Here, the blkptr_t is not used. Instead, the information we need is in the "bonus buffer" (dn_bonus at offset 0x4c0).

> 4c0::print -a -t zfs`dsl_dir_phys_t
4c0 dsl_dir_phys_t {
    4c0 uint64_t dd_creation_time = 0x5203bf9a
    4c8 uint64_t dd_head_dataset_obj = 0x15
    4d0 uint64_t dd_parent_obj = 0
    4d8 uint64_t dd_origin_obj = 0x12
    4e0 uint64_t dd_child_dir_zapobj = 0x4
    4e8 uint64_t dd_used_bytes = 0x5d400
    4f0 uint64_t dd_compressed_bytes = 0x4b200
    4f8 uint64_t dd_uncompressed_bytes = 0x4b200
    500 uint64_t dd_quota = 0
    508 uint64_t dd_reserved = 0
    510 uint64_t dd_props_zapobj = 0x3
    518 uint64_t dd_deleg_zapobj = 0
    520 uint64_t dd_flags = 0x1
    528 uint64_t [5] dd_used_breakdown = [ 0x48400, 0, 0x15000, 0, 0 ]
    550 uint64_t dd_clones = 0
    558 uint64_t dd_filesystem_count = 0
    560 uint64_t dd_snapshot_count = 0
    568 uint64_t [11] dd_pad = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]
}

From here, we'll go to the dd_head_dataset_obj, 0x15.

> 15*200::print -a -t zfs`dnode_phys_t
2a00 dnode_phys_t {
    2a00 uint8_t dn_type = 0x10  <-- DMU_OT_DSL_DATASET
    ...
    2a40 blkptr_t [1] dn_blkptr = [
        2a40 blkptr_t {
            2a40 dva_t [3] blk_dva = [
                2a40 dva_t {
                    2a40 uint64_t [2] dva_word = [ 0, 0 ]
                },
    ...
    2ac0 uint8_t [192] dn_bonus = [ 0x2, 0, 0, 0, 0, 0, 0, 0, 0x12, 0, 0, 0, 0, 
0, 0, 0, 0x1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]
...

The data for the DMU_OT_DSL_DATASET is in the bonus buffer. Let's dump that out.

> 2ac0::print -a -t zfs`dsl_dataset_phys_t
2ac0 dsl_dataset_phys_t {
    2ac0 uint64_t ds_dir_obj = 0x2
    ...
    2b40 blkptr_t ds_bp = {
        2b40 dva_t [3] blk_dva = [
            2b40 dva_t {
                2b40 uint64_t [2] dva_word = [ 0x1, 0x2d6 ]
            },
            2b50 dva_t {
                2b50 uint64_t [2] dva_word = [ 0x1, 0x90d6 ]
            },
            2b60 dva_t {
                2b60 uint64_t [2] dva_word = [ 0, 0 ]
            },
        ]
     ...

And look at the blkptr_t.

> 2b40::blkptr
DVA[0]=<0:5ac00:200>
DVA[1]=<0:121ac00:200>
[L0 OBJSET] FLETCHER_4 LZJB LE contiguous unique double
size=800L/200P birth=11L/11P fill=9
cksum=15955ae455:7d0aed4c6f5:17b63dc48793f:3202bc4dfa3b58
> $q
#

This is another objset_phys_t, this time for the root dataset instead of the MOS. We'll decompress and take a look.

# ./zuncompress -p 200 -l 800 -o 5ac00 /var/tmp/zfsfile > /tmp/ds_objset
# mdb /tmp/ds_objset
> 0::print -a -t zfs`objset_phys_t
0 objset_phys_t {
    0 dnode_phys_t os_meta_dnode = {
        0 uint8_t dn_type = 0xa  <-- DMU_OT_DNODE
        ...
        40 blkptr_t [1] dn_blkptr = [
            40 blkptr_t {
                40 dva_t [3] blk_dva = [
                    40 dva_t {
                        40 uint64_t [2] dva_word = [ 0x2, 0x2d1 ]
                    },
                    50 dva_t {
                        50 uint64_t [2] dva_word = [ 0x2, 0x90d1 ]
                    },
                    60 dva_t {
                        60 uint64_t [2] dva_word = [ 0, 0 ]
                    },
                ]
         ...

Grabbing the blkptr_t as was the case for the MOS objset.

> 40::blkptr
DVA[0]=<0:5a200:400>
DVA[1]=<0:121a200:400>
[L6 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=5a33c7bab6:3e5fa32d9ea0:16d3626ce1ceee:5d94da91be37c8d
> $q
#

For the dataset object set, there are 2 copies of the metadata (unlike the three copies for the MOS). And the "L6" says there are 6 levels of indirection. Indirect blocks are blocks containing blkptr_ts of block containing block pointers... of blocks containing data. In this case, 6 levels deep. We'll look at the first blkptr_t in each of these. Note that if this was a large file system with lots of data, we would probably still need the beginning (root of the file system) to get started. In this particular case, the only blkptr_t being used in all of the indirect blocks is the first one. The rest are "holes" (placeholders for when/if the file system has more objects). Given an object id, the arithmetic needed to find the correct path through the indirect blocks for that object id is covered in the papers mentioned at the beginning of this post.

At this point we'll follow a sequence of decompressing and following the block pointers until we get to level 0 (the dnode_phys_t array for the objects in the (root) dataset).

# ./zuncompress -p 400 -l 4000 -o 5a200 /var/tmp/zfsfile > /tmp/l6_dnode
# mdb /tmp/l6_dnode
> 0::blkptr
DVA[0]=<0:59e00:400>
DVA[1]=<0:1219e00:400>
[L5 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=5a4813c63a:3e6e4b21ab12:16d82a6aab7196:5da1fa71471b3a2
> $q
#
# ./zuncompress -p 400 -l 4000 -o 59e00 /var/tmp/zfsfile > /tmp/l5_dnode
# mdb /tmp/l5_dnode
> 0::blkptr
DVA[0]=<0:59a00:400>
DVA[1]=<0:1219a00:400>
[L4 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=5a07a23ca3:3e2ff2ae47d2:16b9e360815f88:5d048405cb59ba5
> $q
#
# ./zuncompress -p 400 -l 4000 -o 59a00 /var/tmp/zfsfile > /tmp/l4_dnode
# mdb /tmp/l4_dnode
> 0::blkptr
DVA[0]=<0:59600:400>
DVA[1]=<0:1219600:400>
[L3 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=594127027c:3d7854dc4336:1664aa2337fdfd:5b5d2ad4907d3f2
> $q
#
# ./zuncompress -p 400 -l 4000 -o 59600 /var/tmp/zfsfile > /tmp/l3_dnode
# mdb /tmp/l3_dnode
> 0::blkptr
DVA[0]=<0:59200:400>
DVA[1]=<0:1219200:400>
[L2 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=5a6c9eaf90:3e918a332bce:16e93bc40842e1:5dfa7ee35affc19
> $q
#
# ./zuncompress -p 400 -l 4000 -o 59200 /var/tmp/zfsfile > /tmp/l2_dnode
# mdb /tmp/l2_dnode
> 0::blkptr
DVA[0]=<0:58e00:400>
DVA[1]=<0:1218e00:400>
[L1 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=11L/11P fill=9
cksum=573ebf43bc:3c03ae3ccfbe:15cc559f914849:58921efeca0c341
> $q
#
# ./zuncompress -p 400 -l 4000 -o 58e00 /var/tmp/zfsfile > /tmp/l1_dnode
# mdb /tmp/l1_dnode
> 0::blkptr
DVA[0]=<0:58800:600>
DVA[1]=<0:1218800:600>
[L0 DNODE] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/600P birth=11L/11P fill=9
cksum=87a454f048:6092818ca2a1:2d0688f6b70082:104b023c565fb938
> $q
#
# ./zuncompress -p 600 -l 4000 -o 58800 /var/tmp/zfsfile > /tmp/dnodes
#

Now we're at level 0. This is an array of dnode_phys_t for files and directories in the root of the ZFS file system. Let's dump the array.

# mdb /tmp/dnodes
>0,20::print -a -t zfs`dnode_phys_t
0 dnode_phys_t {
    0 uint8_t dn_type = 0  <-- first entry (obj id = 0) is unused
...
200 dnode_phys_t {
    200 uint8_t dn_type = 0x15  <-- DMU_OT_MASTER_NODE
    ...
    240 blkptr_t [1] dn_blkptr = [
        240 blkptr_t {
            240 dva_t [3] blk_dva = [
                240 dva_t {
                    240 uint64_t [2] dva_word = [ 0x1, 0 ]
                },
                250 dva_t {
                    250 uint64_t [2] dva_word = [ 0x1, 0x9000 ]
                },
                260 dva_t {
                    260 uint64_t [2] dva_word = [ 0, 0 ]
                },
            ]
            ...

The second entry is the "master node" for the file system. Let's look at the blkptr_t

> 240::blkptr
DVA[0]=<0:0:200>
DVA[1]=<0:1200000:200>
[L0 MASTER_NODE] FLETCHER_4 LZJB LE contiguous unique double
size=400L/200P birth=4L/4P fill=1
cksum=a23da9de2:4526f62c71b:f0b3b5fb1f03:239ad9c427b988
> $q
#

This is another ZAP block. We'll decompress and take a look.

# ./zuncompress -p 200 -l 400 -o 0 /var/tmp/zfsfile > /tmp/master
# mdb /tmp/master
> 0/K
0:              8000000000000003 
> 0::print -a -t zfs`mzap_phys_t
0 mzap_phys_t {
    0 uint64_t mz_block_type = 0x8000000000000003
    8 uint64_t mz_salt = 0x16d68b53
    10 uint64_t mz_normflags = 0
    18 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    40 mzap_ent_phys_t [1] mz_chunk = [
        40 mzap_ent_phys_t {
            40 uint64_t mze_value = 0
            48 uint32_t mze_cd = 0
            4c uint16_t mze_pad = 0
            4e char [50] mze_name = [ "normalization" ]
        },
    ]
}
>

Let's look at additional entries in the ZAP object. We want an entry for "ROOT".

> .::print -a -t zfs`mzap_ent_phys_t  <-- "." says continue where we left off
80 mzap_ent_phys_t {
    80 uint64_t mze_value = 0
    88 uint32_t mze_cd = 0
    8c uint16_t mze_pad = 0
    8e char [50] mze_name = [ "utf8only" ]
}
> .::print -a -t zfs`mzap_ent_phys_t
c0 mzap_ent_phys_t {
    c0 uint64_t mze_value = 0
    c8 uint32_t mze_cd = 0
    cc uint16_t mze_pad = 0
    ce char [50] mze_name = [ "casesensitivity" ]
}
> .::print -a -t zfs`mzap_ent_phys_t
100 mzap_ent_phys_t {
    100 uint64_t mze_value = 0x5
    108 uint32_t mze_cd = 0
    10c uint16_t mze_pad = 0
    10e char [50] mze_name = [ "VERSION" ]
}
> .::print -a -t zfs`mzap_ent_phys_t
140 mzap_ent_phys_t {
    140 uint64_t mze_value = 0x2
    148 uint32_t mze_cd = 0
    14c uint16_t mze_pad = 0
    14e char [50] mze_name = [ "SA_ATTRS" ]
}
> .::print -a -t zfs`mzap_ent_phys_t
180 mzap_ent_phys_t {
    180 uint64_t mze_value = 0x3
    188 uint32_t mze_cd = 0
    18c uint16_t mze_pad = 0
    18e char [50] mze_name = [ "DELETE_QUEUE" ]
}
> .::print -a -t zfs`mzap_ent_phys_t
1c0 mzap_ent_phys_t {
    1c0 uint64_t mze_value = 0x4
    1c8 uint32_t mze_cd = 0
    1cc uint16_t mze_pad = 0
    1ce char [50] mze_name = [ "ROOT" ]
}
> $q
#

The root directory for the file system is object id 4 (mze_value from above. This is the 5th entry (starts at 0) in the array of dnode_phys_t for the file system. Let's take a look.

> ::status
debugging file '/tmp/dnodes' (object file)
> 4*200::print -a -t zfs`dnode_phys_t
800 dnode_phys_t {
    800 uint8_t dn_type = 0x14  <-- DMU_OT_DIRECTORY_CONTENTS
    ...
    840 blkptr_t [1] dn_blkptr = [
        840 blkptr_t {
            840 dva_t [3] blk_dva = [
                840 dva_t {
                    840 uint64_t [2] dva_word = [ 0x1, 0x2c3 ]
                },
                850 dva_t {
                    850 uint64_t [2] dva_word = [ 0x1, 0x90c3 ]
                },
                860 dva_t {
                    860 uint64_t [2] dva_word = [ 0, 0 ]
                },
            ]
...

Directories are ZAP objects. We'll dump the blkptr_t, decompress if necessary, and find the words file that we copied into the file system at the beginning of this post.

> 840::blkptr
DVA[0]=<0:58600:200>
DVA[1]=<0:1218600:200>
[L0 DIRECTORY_CONTENTS] FLETCHER_4 OFF LE contiguous unique double
size=200L/200P birth=11L/11P fill=1
cksum=27626ee8e:109d3b9097a:395d35f5c237:8703c96b7bd4c
> $q
#

Notice that compression is turned off, and there are no indirect blocks ("L0").

# mdb /var/tmp/zfsfile
> 400000+58600::print -a -t zfs`mzap_phys_t
458600 mzap_phys_t {
    458600 uint64_t mz_block_type = 0x8000000000000003
    458608 uint64_t mz_salt = 0x16d68999
    458610 uint64_t mz_normflags = 0
    458618 uint64_t [5] mz_pad = [ 0, 0, 0, 0, 0 ]
    458640 mzap_ent_phys_t [1] mz_chunk = [
        458640 mzap_ent_phys_t {
            458640 uint64_t mze_value = 0x8000000000000008
            458648 uint32_t mze_cd = 0
            45864c uint16_t mze_pad = 0
            45864e char [50] mze_name = [ "foo" ]
        },
    ]
}
> .::print -a -t zfs`mzap_ent_phys_t
458680 mzap_ent_phys_t {
    458680 uint64_t mze_value = 0x8000000000000009
    458688 uint32_t mze_cd = 0
    45868c uint16_t mze_pad = 0
    45868e char [50] mze_name = [ "words" ]
}
> $q
#

The "words" file is at object id 9. Let's look at that dnode_phys_t.

# mdb /tmp/dnodes
> 9*200::print -a -t zfs`dnode_phys_t
1200 dnode_phys_t {
    1200 uint8_t dn_type = 0x13  <-- DMU_OT_PLAIN_FILE_CONTENTS
    ...
    1240 blkptr_t [1] dn_blkptr = [
        1240 blkptr_t {
            1240 dva_t [3] blk_dva = [
                1240 dva_t {
                    1240 uint64_t [2] dva_word = [ 0x2, 0x27f ]
                },
                1250 dva_t {
                    1250 uint64_t [2] dva_word = [ 0x2, 0x907f ]
                },
                1260 dva_t {
                    1260 uint64_t [2] dva_word = [ 0, 0 ]
                },
            ]
     ...

Let's look at the blkptr_t.

> 1240::blkptr
DVA[0]=<0:4fe00:400>
DVA[1]=<0:120fe00:400>
[L1 PLAIN_FILE_CONTENTS] FLETCHER_4 LZJB LE contiguous unique double
size=4000L/400P birth=9L/9P fill=2
cksum=5d1e925d95:3ed351070323:16995992c8e96c:5b9701a2a4ef414
> $q
#

This is a single indirect block (L1 in the above output. This makes sense as the size of the words file is ~256K. We'll decompress and look at the resulting blkptr_ts.

# ./zuncompress -p 400 -l 4000 -o 4fe00 /var/tmp/zfsfile > /tmp/l1_file
# mdb /tmp/l1_file
> 0::blkptr
DVA[0]=<0:fe00:20000>
[L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique single
size=20000L/20000P birth=9L/9P fill=1
cksum=2f6c9bcce37c:bd82a253b632bb1:acb0037ee619745c:5e7c6fc8adcedccd
> 80::blkptr
DVA[0]=<0:2fe00:20000>
[L0 PLAIN_FILE_CONTENTS] FLETCHER_4 OFF LE contiguous unique single
size=20000L/20000P birth=9L/9P fill=1
cksum=1bae53745c3f:9dc2421d31452d3:658d66823cf4fb0:11c158edbbfcc0f3
> 100::blkptr
<hole>
> $q
#

Now we'll go to the location specified by these block pointers to get our data.

# mdb /var/tmp/zfsfile
> 400000+fe00,20000/c
0x40fe00:       10th
                1st
                2nd
                3rd
                4th
                5th
                6th
                7th
                8th
                9th
                a
                AAA
                AAAS
                Aarhus
                Aaron
                AAU
                ABA
                Ababa
                aback
            ...

And there is the contents of the first 128KB of the file. The remainder of the file is at the block specifed by the blkptr_t at offset 80 in the ::blkptr output.

If this were binary, it is simple enough to use dd(1M), seek to the correct location on the device, and dump from there. For instance,

> (400000+fe00)%200=E
                8319            
> 20000%200=E
                256             
# dd if=/var/tmp/zfsfile iseek=8319 bs=512 count=256
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
...

That's a lot of work. Is there a way to just "see" all of the information? Yes, it's called zdb(1M). But zdb is not interative, and it does not work with destroyed pools (or pools that won't import). Also, I find that using mdb this way forces you to understand the on-disk format. For me, much preferable to having it all done for me.

I mentioned at the beginning of this post that it will only work on illumos-based systems, i.e., systems with mdb. I cannot include Solaris 11 or newer because there is no way to build mdb without source code. But what if you are using ZFS on Linux?

You could upload your devices (or files) as files to manta, along with the modified mdb, the zfs.so and rawzfs.so modules, and the zuncompress program. Then you use mlogin to log into the manta instance and try from there. I've included built copies of mdb, the modules, and zuncompress in the github repo. Note that I have not yet tried this, but it will likely be in a blog post in the next week or so.

Have fun!

:

Sign up Now for Instant Cloud Access

Get Started

View PricingSee Benchmarks