March 29, 2013 - by rachelbalik
Marsell Kukuljevic of Joyent wrote me to say (paraphrasing):
"I thought ZFS record size is variable: by default it's 128K, but write 2KB of data (assuming nothing else writes), then only 2KB writes to disk (excluding metadata). What does record size actually enforce?
I assume this is affected by transaction groups, so if I write 6 2K files, it'll write a 12K record, but if I write 6 32K files, it'll write two records: 128K and 64K. That causes a problem with read and write magnification in future writes though, so I'm not sure if such behaviour makes sense. Maybe recordsize only affects writes within a file?
I'm asking this in context of one of the recommendations in the evil tuning guide, to use a recordsize of 8K to match with Postgres' buffer size. Fair enough, I presume this means that records written to disk are then always at most 8KB (ignoring any headers and footers), but how does compression factor into this?
I've noticed that Postgres compresses quite well. With LZJB it still gets ~3x. Assuming a recordsize of 8K, then it'd be about 3KB written to disk for that record (again, excluding all the metadata), right?"
The recordsize parameter enforces the size of the largest block written to a ZFS file system or volume. There is an excellent blog about the ZFS recordsize here. Note that ZFS does not always read/write recordsize bytes. For instance, a write of 2K to a file will typically result in at least one 2KB write (and maybe more than one for metadata). The recordsize is the largest block that ZFS will read/write. The interested reader can verify this by using DTrace on bdev_strategy(), left as an exercise. Also note that because of the way ZFS maintains information about allocated/free space on disk (i.e., spacemaps), smaller recordsize should not result in more space or time being used to maintain that information.
Instead of repeating the blog post, let's do some experimenting.
To make things easy (i.e., we don't want to sift through tens of thousands of lines of zdb(1M) output), we'll create a small pool and work with that. I'm assuming you are on a system that supports ZFS and has zdb. SmartOS would be an excellent choice...
#
# mkfile 100m /var/tmp/poolfile
# zpool create testpool /var/tmp/poolfile
# zfs get recordsize,compression testpool
NAME PROPERTY VALUE SOURCE
testpool recordsize 128K default
testpool compression off default
#
An alternative to using files (/var/tmp/poolfile), is to create a child dataset using the zfs command, and run zdb on the child dataset. This also cuts down on the amount of data displayed by zdb.
We'll start with the simplest case:
# dd if=/dev/zero of=/testpool/foo bs=128k count=1
1+0 records in
1+0 records out
# zdb -dddddddd testpool
...
Object lvl iblk dblk dsize lsize %full type
21 1 16K 128K 128K 128K 100.00 ZFS plain file (K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path /foo
uid 0
gid 0
atime Thu Mar 21 03:50:24 2013
mtime Thu Mar 21 03:50:24 2013
ctime Thu Mar 21 03:50:24 2013
crtime Thu Mar 21 03:50:24 2013
gen 2462
mode 100644
size 131072
parent 4
links 1
pflags 40800000004
Indirect blocks:
0 L0 0:1b4800:20000 20000L/20000P F=1 B=2462/2462
segment [0000000000000000, 0000000000020000) size 128K
...
#
From the above output, we can see that the "foo" file has one block. It is on vdev 0 (the only vdev in the pool), at offset 0x1b4800 (relative to the 4MB label at the beginning of every disk), and size is 0x20000 (=128K). Note that if you're following along, and don't see the "/foo" file in your output, run sync, or wait a few seconds. Generally, it can take up to 5 seconds before the data is on disk. This implies that zdb reads from disk, bypassing ARC (which is what you want for a file system debugger).
Now let's do the same for a 2KB file.
# rm /testpool/foo
# dd if=/dev/zero of=/testpool/foo bs=2k count=1
# zdb -dddddddd testpool
...
Object lvl iblk dblk dsize lsize %full type
22 1 16K 2K 2K 2K 100.00 ZFS plain file (K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 0
path /foo
uid 0
gid 0
atime Thu Mar 21 04:21:25 2013
mtime Thu Mar 21 04:21:25 2013
ctime Thu Mar 21 04:21:25 2013
crtime Thu Mar 21 04:21:25 2013
gen 2839
mode 100644
size 2048
parent 4
links 1
pflags 40800000004
Indirect blocks:
0 L0 0:180000:800 800L/800P F=1 B=2839/2839
segment [0000000000000000, 0000000000000800) size 2K
...
#
So, as Marsell notes, block size is variable. Here, the foo file is at offset 0x180000 and size is 0x800 (=2K). What if we use a larger block size than 128KB to dd?
# rm /testpool/foo
# dd if=/dev/zero of=/testpool/foo bs=256k count=1
1+0 records in
1+0 records out
# zdb -dddddddd testpool
...
Object lvl iblk dblk dsize lsize %full type
23 2 16K 128K 258K 256K 100.00 ZFS plain file (K=inherit) (Z=inherit)
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 1
path /foo
uid 0
gid 0
atime Thu Mar 21 04:23:32 2013
mtime Thu Mar 21 04:23:32 2013
ctime Thu Mar 21 04:23:32 2013
crtime Thu Mar 21 04:23:32 2013
gen 2868
mode 100644
size 262144
parent 4
links 1
pflags 40800000004
Indirect blocks:
0 L1 0:1f0c00:400 0:12b5a00:400 4000L/400P F=2 B=2868/2868
0 L0 0:1b3800:20000 20000L/20000P F=1 B=2868/2868
20000 L0 0:180000:20000 20000L/20000P F=1 B=2868/2868
segment [0000000000000000, 0000000000040000) size 256K
...
#
This time, the file has 2 blocks, each 128KB large. Because the data does not fit into 1 block, there is 1 indirect block (block containing block pointers) at 0x1f0c00, and it is 0x400 (1KB) on disk. The indirect block is compressed. Decompressed, it is 0x4000 bytes (=16KB). The "4000L/400P" refers to the logical size (4000L) and the physical size (400P). Logical is after decompression, physical is the size compressed on disk. Turning off compression only effects indirect blocks. Other metadata is always compressed (always lzjb??).
Now we'll try creating 6 2-KB files, and see what that gives us. (Note that output has been omitted.)
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; done
...
# sync
# zdb -dddddddd testpool
...
path /f1
Indirect blocks:
0 L0 0:80000:800 800L/800P F=1 B=4484/4484
path /f2
Indirect blocks:
0 L0 0:81200:800 800L/800P F=1 B=4484/4484
path /f3
Indirect blocks:
0 L0 0:81a00:800 800L/800P F=1 B=4484/4484
path /f4
Indirect blocks:
0 L0 0:82200:800 800L/800P F=1 B=4484/4484
path /f5
Indirect blocks:
0 L0 0:82a00:800 800L/800P F=1 B=4484/4484
path /f6
Indirect blocks:
0 L0 0:87200:800 800L/800P F=1 B=4484/4484
...
So, they all fit in the same 128k block (between 0x80000 and 0xa0000). And they are all in the same transaction group (4484). There is a gap between the space used for file f1 and file f2, but f2 through f6 are contiguous. Does this result in one write to the disk? Hard to say as it is difficult to correlate writes to disk with writes to ZFS files, and also because the "disk" is actually a file. It should be possible to determine if it is one write or multiple by using DTrace and a child dataset in a pool with real disks. Would we get the same behavior if the writes were in separate transaction groups? Let's try to find out.
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=2k count=1; sleep 6; done
...
# zdb -dddddddd testpool
...
path /f1
Indirect blocks:
0 L0 0:ef400:800 800L/800P F=1 B=4827/4827
path /f2
Indirect blocks:
0 L0 0:fd800:800 800L/800P F=1 B=4828/4828
path /f3
Indirect blocks:
0 L0 0:82a00:800 800L/800P F=1 B=4829/4829
path /f4
Indirect blocks:
0 L0 0:88000:800 800L/800P F=1 B=4831/4831
path /f5
Indirect blocks:
0 L0 0:8b200:800 800L/800P F=1 B=4832/4832
path /f6
Indirect blocks:
0 L0 0:8ca00:800 800L/800P F=1 B=4833/4833
...
Each write is in a different transaction group, and they are not contiguous. Some are in different 128KB blocks.
So, back to Marsell's questions... Marsell says that if he writes 6 2KB files, it will be a 12KB write. That is not clear from the above output. In fact, it may be 6 2KB writes, it might be 1 128KB write. It could even be 1 12KB write. I ran the first 6x2KB a second time, and all of the files were contiguous on disk. It is also possible that even if the writes are in different transaction groups, they could all be contiguous.
Let's write 6 32KB files.
# for i in {1..6}; do dd if=/dev/zero of=/testpool/f$i bs=32k count=1; done
# zdb -dddddddd testpool
...
path /f1
Indirect blocks:
0 L0 0:8da00:8000 8000L/8000P F=1 B=5108/5108
path /f2
Indirect blocks:
0 L0 0:a8e00:8000 8000L/8000P F=1 B=5108/5108
path /f3
Indirect blocks:
0 L0 0:b0e00:8000 8000L/8000P F=1 B=5108/5108
path /f4
Indirect blocks:
0 L0 0:b8e00:8000 8000L/8000P F=1 B=5108/5108
path /f5
Indirect blocks:
0 L0 0:efc00:8000 8000L/8000P F=1 B=5108/5108
path /f6
Indirect blocks:
0 L0 0:da400:8000 8000L/8000P F=1 B=5108/5108
...
The writes are all in the same transaction group, but not all in the same 128KB block. In fact, a single write may be spread across transaction groups. Note that this implies that there can be data loss, i.e., not all data written in one write call ends up on disk if there is a power failure. ZFS guarantees consistency of the file system, i.e., the transaction is all or none. If there are multiple transactions, some transactions may not make it to disk. Applications concerned about this should either use synchronous writes, or have some other recovery mechanism. Note that synchronous writes use the ZFS intent log (ZIL), so performance may not be compromised.
Here is a single write of 4MB.
# dd if=/dev/zero of=/testpool/big bs=4096k count=1
1+0 records in
1+0 records out
# sync
# zdb -dddddddd testpool
...
path /big
Indirect blocks:
0 L1 0:620000:400 0:1300000:400 4000L/400P F=32 B=5410/5410
0 L0 0:200000:20000 20000L/20000P F=1 B=5409/5409
20000 L0 0:220000:20000 20000L/20000P F=1 B=5409/5409
40000 L0 0:240000:20000 20000L/20000P F=1 B=5409/5409
60000 L0 0:260000:20000 20000L/20000P F=1 B=5409/5409
80000 L0 0:280000:20000 20000L/20000P F=1 B=5409/5409
a0000 L0 0:2a0000:20000 20000L/20000P F=1 B=5409/5409
c0000 L0 0:2c0000:20000 20000L/20000P F=1 B=5409/5409
e0000 L0 0:2e0000:20000 20000L/20000P F=1 B=5409/5409
100000 L0 0:300000:20000 20000L/20000P F=1 B=5409/5409
120000 L0 0:320000:20000 20000L/20000P F=1 B=5409/5409
140000 L0 0:340000:20000 20000L/20000P F=1 B=5409/5409
160000 L0 0:360000:20000 20000L/20000P F=1 B=5409/5409
180000 L0 0:380000:20000 20000L/20000P F=1 B=5409/5409
1a0000 L0 0:3a0000:20000 20000L/20000P F=1 B=5409/5409
1c0000 L0 0:3c0000:20000 20000L/20000P F=1 B=5409/5409
1e0000 L0 0:3e0000:20000 20000L/20000P F=1 B=5409/5409
200000 L0 0:400000:20000 20000L/20000P F=1 B=5409/5409
220000 L0 0:420000:20000 20000L/20000P F=1 B=5409/5409
240000 L0 0:440000:20000 20000L/20000P F=1 B=5409/5409
260000 L0 0:460000:20000 20000L/20000P F=1 B=5409/5409
280000 L0 0:485e00:20000 20000L/20000P F=1 B=5410/5410
2a0000 L0 0:4a5e00:20000 20000L/20000P F=1 B=5410/5410
2c0000 L0 0:4c5e00:20000 20000L/20000P F=1 B=5410/5410
2e0000 L0 0:500000:20000 20000L/20000P F=1 B=5410/5410
300000 L0 0:520000:20000 20000L/20000P F=1 B=5410/5410
320000 L0 0:540000:20000 20000L/20000P F=1 B=5410/5410
340000 L0 0:560000:20000 20000L/20000P F=1 B=5410/5410
360000 L0 0:580000:20000 20000L/20000P F=1 B=5410/5410
380000 L0 0:5a0000:20000 20000L/20000P F=1 B=5410/5410
3a0000 L0 0:5c0000:20000 20000L/20000P F=1 B=5410/5410
3c0000 L0 0:5e0000:20000 20000L/20000P F=1 B=5410/5410
3e0000 L0 0:600000:20000 20000L/20000P F=1 B=5410/5410
The write is spread across 2 transaction groups. Examining the code in zfs_write(), you can see that each write is broken into recordsize blocks, results in a separate transactions (see the calls to dmu_tx_create() in that code). The transactions can be spread across multiple transaction groups (calls to dmu_tx_assign()).
If the recordsize is set to 8k, the maximum size of a block will be 8KB. Let's give that a try and look at results. Blocks that are already allocated are not effected.
# zfs set recordsize=8192 testpool
# zfs get recordsize testpool
NAME PROPERTY VALUE SOURCE
testpool recordsize 8K local
# dd if=/dev/zero of=/testpool/smallblock bs=128k count=1
1+0 records in
1+0 records out
# sync
# zdb -dddddddd testpool
...
path /smallblock
Indirect blocks:
0 L1 0:64e800:400 0:1312c00:400 4000L/400P F=16 B=5653/5653
0 L0 0:624800:2000 2000L/2000P F=1 B=5653/5653
2000 L0 0:627c00:2000 2000L/2000P F=1 B=5653/5653
4000 L0 0:632800:2000 2000L/2000P F=1 B=5653/5653
6000 L0 0:634800:2000 2000L/2000P F=1 B=5653/5653
8000 L0 0:636800:2000 2000L/2000P F=1 B=5653/5653
a000 L0 0:638800:2000 2000L/2000P F=1 B=5653/5653
c000 L0 0:63a800:2000 2000L/2000P F=1 B=5653/5653
e000 L0 0:63c800:2000 2000L/2000P F=1 B=5653/5653
10000 L0 0:63e800:2000 2000L/2000P F=1 B=5653/5653
12000 L0 0:640800:2000 2000L/2000P F=1 B=5653/5653
14000 L0 0:642800:2000 2000L/2000P F=1 B=5653/5653
16000 L0 0:644800:2000 2000L/2000P F=1 B=5653/5653
18000 L0 0:646800:2000 2000L/2000P F=1 B=5653/5653
1a000 L0 0:648800:2000 2000L/2000P F=1 B=5653/5653
1c000 L0 0:64a800:2000 2000L/2000P F=1 B=5653/5653
1e000 L0 0:64c800:2000 2000L/2000P F=1 B=5653/5653
Basically, the behavior is the same as with the default 128KB recordsize, except that the maximum size of a block is 8KB. This should hold for all blocks (data and metadata). Any modified metadata (due to copy-on-write) will also use the smaller block size. As for performance implications, I'll leave that to the Roch Bourbonnais blog, referenced at the beginning.
For compression, nothing really changes. The maximum size of a compressed block is the recordsize. We'll reset the recordsize to the default, and turn on lzjb compression.
# zfs set recordsize=128k testpool
# zfs set compression=lzjb testpool
# zfs get recordsize,compression testpool
NAME PROPERTY VALUE SOURCE
testpool recordsize 128K local
testpool compression lzjb local
#
And write 256KB...
# dd if=/dev/zero of=/testpool/zero bs=128k count=2
2+0 records in
2+0 records out
# zdb -dddddddd testpool
...
path /zero
Indirect blocks:
Good, so compression of all NULLs resulted in no blocks. Let's write data.
# dd if=/usr/dict/words of=/testpool/foo.compressed bs=128k count=2
1+1 records in
1+1 records out
# zdb -dddddddd testpool
...
path /foo.compressed
Indirect blocks:
0 L1 0:6b2200:400 0:1390800:400 4000L/400P F=2 B=5830/5830
0 L0 0:690600:15200 20000L/15200P F=1 B=5830/5830
20000 L0 0:6a5800:ca00 20000L/ca00P F=1 B=5830/5830
So, 1 indirect block and 2 blocks of compressed data. The first block of compressed data is 0x15200 in size, the second is 0xca00. The two blocks are contiguous, so it is possible they are written in 1 write to the disk.
To conclude, recordsize is handled at the block level. It is the maximum size of a block that may be written by ZFS. Existing data/metadata is not changed if the recordsize is changed, and/or if compression is used. As for performance tuning, I would be careful of putting too much faith in the ZFS evil tuning guide. It is dated, some of the descriptions are not accurate, and there are things missing.
I'll have another ZFS related blog soon. Currently waiting for a bug to be fixed in zdb.
We offer comprehensive training for Triton Developers, Operators and End Users.