Bruning Questions: ZFS Record Size

March 29, 2013 - by rachelbalik

ask-mr-bruning-logo

Marsell Kukuljevic of Joyent wrote me to say (paraphrasing):

"I thought ZFS record size is variable: by default it's 128K, but write 2KB of data (assuming nothing else writes), then only 2KB writes to disk (excluding metadata). What does record size actually enforce?

I assume this is affected by transaction groups, so if I write 6 2K files, it'll write a 12K record, but if I write 6 32K files, it'll write two records: 128K and 64K. That causes a problem with read and write magnification in future writes though, so I'm not sure if such behaviour makes sense. Maybe recordsize only affects writes within a file?

I'm asking this in context of one of the recommendations in the evil tuning guide, to use a recordsize of 8K to match with Postgres' buffer size. Fair enough, I presume this means that records written to disk are then always at most 8KB (ignoring any headers and footers), but how does compression factor into this?

I've noticed that Postgres compresses quite well. With LZJB it still gets ~3x. Assuming a recordsize of 8K, then it'd be about 3KB written to disk for that record (again, excluding all the metadata), right?"

The recordsize parameter enforces the size of the largest block written to a ZFS file system or volume. There is an excellent blog about the ZFS recordsize here. Note that ZFS does not always read/write recordsize bytes. For instance, a write of 2K to a file will typically result in at least one 2KB write (and maybe more than one for metadata). The recordsize is the largest block that ZFS will read/write. The interested reader can verify this by using DTrace on bdev_strategy(), left as an exercise. Also note that because of the way ZFS maintains information about allocated/free space on disk (i.e., spacemaps), smaller recordsize should not result in more space or time being used to maintain that information.

Instead of repeating the blog post, let's do some experimenting.

To make things easy (i.e., we don't want to sift through tens of thousands of lines of zdb(1M) output), we'll create a small pool and work with that. I'm assuming you are on a system that supports ZFS and has zdb. SmartOS would be an excellent choice...

#
# mkfile 100m /var/tmp/poolfile
# zpool create testpool /var/tmp/poolfile
# zfs get recordsize,compression testpool
NAME      PROPERTY     VALUE     SOURCE
testpool  recordsize   128K      default
testpool  compression  off       default
#

An alternative to using files (/var/tmp/poolfile), is to create a child dataset using the zfs command, and run zdb on the child dataset. This also cuts down on the amount of data displayed by zdb.

We'll start with the simplest case:

# dd if=/dev/zero of=/testpool/foo bs=128k count=1
1+0 records in
1+0 records out
# zdb -dddddddd testpool
...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        21    1    16K   128K   128K   128K  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
       dnode flags: USED_BYTES USERUSED_ACCOUNTED 
     dnode maxblkid: 0
       path    /foo
    uid     0
       gid     0
       atime   Thu Mar 21 03:50:24 2013
        mtime   Thu Mar 21 03:50:24 2013
        ctime   Thu Mar 21 03:50:24 2013
        crtime  Thu Mar 21 03:50:24 2013
        gen     2462
    mode    100644
  size    131072
  parent  4
       links   1
       pflags  40800000004
Indirect blocks:
               0 L0 0:1b4800:20000 20000L/20000P F=1 B=2462/2462

             segment [0000000000000000, 0000000000020000) size  128K
...
#

From the above output, we can see that the "foo" file has one block. It is on vdev 0 (the only vdev in the pool), at offset 0x1b4800 (relative to the 4MB label at the beginning of every disk), and size is 0x20000 (=128K). Note that if you're following along, and don't see the "/foo" file in your output, run sync, or wait a few seconds. Generally, it can take up to 5 seconds before the data is on disk. This implies that zdb reads from disk, bypassing ARC (which is what you want for a file system debugger).

Now let's do the same for a 2KB file.

# rm /testpool/foo
# dd if=/dev/zero of=/testpool/foo bs=2k count=1
# zdb -dddddddd testpool
...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        22    1    16K     2K     2K     2K  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
       dnode flags: USED_BYTES USERUSED_ACCOUNTED 
     dnode maxblkid: 0
       path    /foo
    uid     0
       gid     0
       atime   Thu Mar 21 04:21:25 2013
        mtime   Thu Mar 21 04:21:25 2013
        ctime   Thu Mar 21 04:21:25 2013
        crtime  Thu Mar 21 04:21:25 2013
        gen     2839
    mode    100644
  size    2048
    parent  4
       links   1
       pflags  40800000004
Indirect blocks:
               0 L0 0:180000:800 800L/800P F=1 B=2839/2839

           segment [0000000000000000, 0000000000000800) size    2K

...
#

So, as Marsell notes, block size is variable. Here, the foo file is at offset 0x180000 and size is 0x800 (=2K). What if we use a larger block size than 128KB to dd?

# rm /testpool/foo
# dd if=/dev/zero of=/testpool/foo bs=256k count=1
1+0 records in
1+0 records out
# zdb -dddddddd testpool
...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
        23    2    16K   128K   258K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
       dnode flags: USED_BYTES USERUSED_ACCOUNTED 
     dnode maxblkid: 1
       path    /foo
    uid     0
       gid     0
       atime   Thu Mar 21 04:23:32 2013
        mtime   Thu Mar 21 04:23:32 2013
        ctime   Thu Mar 21 04:23:32 2013
        crtime  Thu Mar 21 04:23:32 2013
        gen     2868
    mode    100644
  size    262144
  parent  4
       links   1
       pflags  40800000004
Indirect blocks:
               0 L1  0:1f0c00:400 0:12b5a00:400 4000L/400P F=2 B=2868/2868
               0  L0 0:1b3800:20000 20000L/20000P F=1 B=2868/2868
           20000  L0 0:180000:20000 20000L/20000P F=1 B=2868/2868

         segment [0000000000000000, 0000000000040000) size  256K

...
#

This time, the file has 2 blocks, each 128KB large. Because the data does not fit into 1 block, there is 1 indirect block (block containing block pointers) at 0x1f0c00, and it is 0x400 (1KB) on disk. The indirect block is compressed. Decompressed, it is 0x4000 bytes (=16KB). The "4000L/400P" refers to the logical size (4000L) and the physical size (400P). Logical is after decompression, physical is the size compressed on disk. Turning off compression only effects indirect blocks. Other metadata is always compressed (always lzjb??).

Now we'll try creating 6 2-KB files, and see what that gives us. (Note that output has been omitted.)

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; done
...
# sync
# zdb -dddddddd testpool
...
       path    /f1
Indirect blocks:
               0 L0 0:80000:800 800L/800P F=1 B=4484/4484

       path    /f2
Indirect blocks:
               0 L0 0:81200:800 800L/800P F=1 B=4484/4484

       path    /f3
Indirect blocks:
               0 L0 0:81a00:800 800L/800P F=1 B=4484/4484

       path    /f4
Indirect blocks:
               0 L0 0:82200:800 800L/800P F=1 B=4484/4484

       path    /f5
Indirect blocks:
               0 L0 0:82a00:800 800L/800P F=1 B=4484/4484

       path    /f6
Indirect blocks:
               0 L0 0:87200:800 800L/800P F=1 B=4484/4484
...

So, they all fit in the same 128k block (between 0x80000 and 0xa0000). And they are all in the same transaction group (4484). There is a gap between the space used for file f1 and file f2, but f2 through f6 are contiguous. Does this result in one write to the disk? Hard to say as it is difficult to correlate writes to disk with writes to ZFS files, and also because the "disk" is actually a file. It should be possible to determine if it is one write or multiple by using DTrace and a child dataset in a pool with real disks. Would we get the same behavior if the writes were in separate transaction groups? Let's try to find out.

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; sleep 6; done
...
# zdb -dddddddd testpool
...
       path    /f1
Indirect blocks:
               0 L0 0:ef400:800 800L/800P F=1 B=4827/4827

       path    /f2
Indirect blocks:
               0 L0 0:fd800:800 800L/800P F=1 B=4828/4828

       path    /f3
Indirect blocks:
               0 L0 0:82a00:800 800L/800P F=1 B=4829/4829

       path    /f4
Indirect blocks:
               0 L0 0:88000:800 800L/800P F=1 B=4831/4831

       path    /f5
Indirect blocks:
               0 L0 0:8b200:800 800L/800P F=1 B=4832/4832

       path    /f6
Indirect blocks:
               0 L0 0:8ca00:800 800L/800P F=1 B=4833/4833
...

Each write is in a different transaction group, and they are not contiguous. Some are in different 128KB blocks.

So, back to Marsell's questions... Marsell says that if he writes 6 2KB files, it will be a 12KB write. That is not clear from the above output. In fact, it may be 6 2KB writes, it might be 1 128KB write. It could even be 1 12KB write. I ran the first 6x2KB a second time, and all of the files were contiguous on disk. It is also possible that even if the writes are in different transaction groups, they could all be contiguous.

Let's write 6 32KB files.

# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=32k count=1; done
# zdb -dddddddd testpool
...
       path    /f1
Indirect blocks:
               0 L0 0:8da00:8000 8000L/8000P F=1 B=5108/5108

       path    /f2
Indirect blocks:
               0 L0 0:a8e00:8000 8000L/8000P F=1 B=5108/5108

       path    /f3
Indirect blocks:
               0 L0 0:b0e00:8000 8000L/8000P F=1 B=5108/5108

       path    /f4
Indirect blocks:
               0 L0 0:b8e00:8000 8000L/8000P F=1 B=5108/5108

       path    /f5
Indirect blocks:
               0 L0 0:efc00:8000 8000L/8000P F=1 B=5108/5108

       path    /f6
Indirect blocks:
               0 L0 0:da400:8000 8000L/8000P F=1 B=5108/5108
...

The writes are all in the same transaction group, but not all in the same 128KB block. In fact, a single write may be spread across transaction groups. Note that this implies that there can be data loss, i.e., not all data written in one write call ends up on disk if there is a power failure. ZFS guarantees consistency of the file system, i.e., the transaction is all or none. If there are multiple transactions, some transactions may not make it to disk. Applications concerned about this should either use synchronous writes, or have some other recovery mechanism. Note that synchronous writes use the ZFS intent log (ZIL), so performance may not be compromised.

Here is a single write of 4MB.

# dd if=/dev/zero of=/testpool/big bs=4096k count=1
1+0 records in
1+0 records out
# sync
# zdb -dddddddd testpool
...
      path    /big
Indirect blocks:
               0 L1  0:620000:400 0:1300000:400 4000L/400P F=32 B=5410/5410
               0  L0 0:200000:20000 20000L/20000P F=1 B=5409/5409
           20000  L0 0:220000:20000 20000L/20000P F=1 B=5409/5409
           40000  L0 0:240000:20000 20000L/20000P F=1 B=5409/5409
           60000  L0 0:260000:20000 20000L/20000P F=1 B=5409/5409
           80000  L0 0:280000:20000 20000L/20000P F=1 B=5409/5409
           a0000  L0 0:2a0000:20000 20000L/20000P F=1 B=5409/5409
           c0000  L0 0:2c0000:20000 20000L/20000P F=1 B=5409/5409
           e0000  L0 0:2e0000:20000 20000L/20000P F=1 B=5409/5409
          100000  L0 0:300000:20000 20000L/20000P F=1 B=5409/5409
          120000  L0 0:320000:20000 20000L/20000P F=1 B=5409/5409
          140000  L0 0:340000:20000 20000L/20000P F=1 B=5409/5409
          160000  L0 0:360000:20000 20000L/20000P F=1 B=5409/5409
          180000  L0 0:380000:20000 20000L/20000P F=1 B=5409/5409
          1a0000  L0 0:3a0000:20000 20000L/20000P F=1 B=5409/5409
          1c0000  L0 0:3c0000:20000 20000L/20000P F=1 B=5409/5409
          1e0000  L0 0:3e0000:20000 20000L/20000P F=1 B=5409/5409
          200000  L0 0:400000:20000 20000L/20000P F=1 B=5409/5409
          220000  L0 0:420000:20000 20000L/20000P F=1 B=5409/5409
          240000  L0 0:440000:20000 20000L/20000P F=1 B=5409/5409
          260000  L0 0:460000:20000 20000L/20000P F=1 B=5409/5409
          280000  L0 0:485e00:20000 20000L/20000P F=1 B=5410/5410
          2a0000  L0 0:4a5e00:20000 20000L/20000P F=1 B=5410/5410
          2c0000  L0 0:4c5e00:20000 20000L/20000P F=1 B=5410/5410
          2e0000  L0 0:500000:20000 20000L/20000P F=1 B=5410/5410
          300000  L0 0:520000:20000 20000L/20000P F=1 B=5410/5410
          320000  L0 0:540000:20000 20000L/20000P F=1 B=5410/5410
          340000  L0 0:560000:20000 20000L/20000P F=1 B=5410/5410
          360000  L0 0:580000:20000 20000L/20000P F=1 B=5410/5410
          380000  L0 0:5a0000:20000 20000L/20000P F=1 B=5410/5410
          3a0000  L0 0:5c0000:20000 20000L/20000P F=1 B=5410/5410
          3c0000  L0 0:5e0000:20000 20000L/20000P F=1 B=5410/5410
          3e0000  L0 0:600000:20000 20000L/20000P F=1 B=5410/5410

The write is spread across 2 transaction groups. Examining the code in zfs_write(), you can see that each write is broken into recordsize blocks, results in a separate transactions (see the calls to dmu_tx_create() in that code). The transactions can be spread across multiple transaction groups (calls to dmu_tx_assign()).

If the recordsize is set to 8k, the maximum size of a block will be 8KB. Let's give that a try and look at results. Blocks that are already allocated are not effected.

# zfs set recordsize=8192 testpool
# zfs get recordsize testpool
NAME      PROPERTY    VALUE    SOURCE
testpool  recordsize  8K       local
# dd if=/dev/zero of=/testpool/smallblock bs=128k count=1
1+0 records in
1+0 records out
# sync
# zdb -dddddddd testpool
...
      path    /smallblock
Indirect blocks:
               0 L1  0:64e800:400 0:1312c00:400 4000L/400P F=16 B=5653/5653
               0  L0 0:624800:2000 2000L/2000P F=1 B=5653/5653
            2000  L0 0:627c00:2000 2000L/2000P F=1 B=5653/5653
            4000  L0 0:632800:2000 2000L/2000P F=1 B=5653/5653
            6000  L0 0:634800:2000 2000L/2000P F=1 B=5653/5653
            8000  L0 0:636800:2000 2000L/2000P F=1 B=5653/5653
            a000  L0 0:638800:2000 2000L/2000P F=1 B=5653/5653
            c000  L0 0:63a800:2000 2000L/2000P F=1 B=5653/5653
            e000  L0 0:63c800:2000 2000L/2000P F=1 B=5653/5653
           10000  L0 0:63e800:2000 2000L/2000P F=1 B=5653/5653
           12000  L0 0:640800:2000 2000L/2000P F=1 B=5653/5653
           14000  L0 0:642800:2000 2000L/2000P F=1 B=5653/5653
           16000  L0 0:644800:2000 2000L/2000P F=1 B=5653/5653
           18000  L0 0:646800:2000 2000L/2000P F=1 B=5653/5653
           1a000  L0 0:648800:2000 2000L/2000P F=1 B=5653/5653
           1c000  L0 0:64a800:2000 2000L/2000P F=1 B=5653/5653
           1e000  L0 0:64c800:2000 2000L/2000P F=1 B=5653/5653

Basically, the behavior is the same as with the default 128KB recordsize, except that the maximum size of a block is 8KB. This should hold for all blocks (data and metadata). Any modified metadata (due to copy-on-write) will also use the smaller block size. As for performance implications, I'll leave that to the Roch Bourbonnais blog, referenced at the beginning.

For compression, nothing really changes. The maximum size of a compressed block is the recordsize. We'll reset the recordsize to the default, and turn on lzjb compression.

# zfs set recordsize=128k testpool
# zfs set compression=lzjb testpool
# zfs get recordsize,compression testpool
NAME      PROPERTY     VALUE     SOURCE
testpool  recordsize   128K      local
testpool  compression  lzjb      local
#

And write 256KB...

# dd if=/dev/zero of=/testpool/zero bs=128k count=2
2+0 records in
2+0 records out
# zdb -dddddddd testpool
...
       path    /zero
Indirect blocks:

Good, so compression of all NULLs resulted in no blocks. Let's write data.

# dd if=/usr/dict/words of=/testpool/foo.compressed bs=128k count=2
1+1 records in
1+1 records out
# zdb -dddddddd testpool
...
       path    /foo.compressed
Indirect blocks:
               0 L1  0:6b2200:400 0:1390800:400 4000L/400P F=2 B=5830/5830
               0  L0 0:690600:15200 20000L/15200P F=1 B=5830/5830
           20000  L0 0:6a5800:ca00 20000L/ca00P F=1 B=5830/5830

So, 1 indirect block and 2 blocks of compressed data. The first block of compressed data is 0x15200 in size, the second is 0xca00. The two blocks are contiguous, so it is possible they are written in 1 write to the disk.

To conclude, recordsize is handled at the block level. It is the maximum size of a block that may be written by ZFS. Existing data/metadata is not changed if the recordsize is changed, and/or if compression is used. As for performance tuning, I would be careful of putting too much faith in the ZFS evil tuning guide. It is dated, some of the descriptions are not accurate, and there are things missing.

I'll have another ZFS related blog soon. Currently waiting for a bug to be fixed in zdb.

You can also submit your DTrace, MDB, or SmartOS questions to be answered in a future column by emailing MrBruning@joyent.com, ask at #BruningQuestions, or attend one of my upcoming courses.

:

Sign up now for Instant Cloud Access Get Started