Bruning Questions: ZFS RaidZ Striping

April 12, 2013 - by Mr. Max Bruning

Recently on the ZFS mailing list (see http://wiki.illumos.org/display/illumos/illumos+Mailing+Lists, there was some discussion about how ZFS distributes data across disks. I thought I might show some work I've done to better understand this.

The disk blocks for a raidz/raidz2/raidz3 vdev are across all vdevs in the pool. So, for instance, the data for block offset 0xc00000 with size 0x20000 (as reported by zdb(1M)) could be striped at different locations and various sizes on the individual disks within the raidz volume. In other words, the offsets and sizes are absolute with respect to the volume, not to the individual disks.

The work of mapping a raidz pool offset and size to individual disks within the pool is done by vdev_raidz_map_alloc(). (Note that this routine has been changed SmartOS in support of allowing system crash dumps to be written to raidz volumes. A change that will eventually be pushed upstream to illumos.)

Let's go through an example. First, we'll set up a raidz pool and put some data into it.

# mkfile 100m /var/tmp/f0 /var/tmp/f1 /var/tmp/f2 /var/tmp/f3 /var/tmp/f4
# zpool create rzpool raidz /var/tmp/f0 /var/tmp/f1 /var/tmp/f2 /var/tmp/f3 /var/tmp/f4
# cp /usr/dict/words /rzpool/words
#

And now let's see the blocks assigned to the /rzpool/words file.

# zdb -dddddddd rzpool
...
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         8    2    16K   128K   259K   256K  100.00  ZFS plain file (K=inherit) (Z=inherit)
                                        168   bonus  System attributes
       dnode flags: USED_BYTES USERUSED_ACCOUNTED 
     dnode maxblkid: 1
       path    /words
  uid     0
       gid     0
       atime   Thu Apr 11 11:46:09 2013
        mtime   Thu Apr 11 11:46:09 2013
        ctime   Thu Apr 11 11:46:09 2013
        crtime  Thu Apr 11 11:46:09 2013
        gen     7
       mode    100444
  size    206674
  parent  4
       links   1
       pflags  40800000004
Indirect blocks:
               0 L1  0:64c00:800 0:5c14c00:800 4000L/400P F=2 B=7/7
               0  L0 0:14c00:28000 20000L/20000P F=1 B=7/7
           20000  L0 0:3cc00:28000 20000L/20000P F=1 B=7/7

              segment [0000000000000000, 0000000000040000) size  256K

So, there are two blocks, one at offset 0x14c00 and the other at offset 0x3cc00, both of them 0x28000 bytes. Of the 0x28000 bytes, 0x8000 is parity. The real data size is 0x20000. The question is, where does this data physically reside, i.e., which disk(s), and where on those disks?

I wrote a small program that uses the code from the vdev_raidz_map_alloc() routine to tell me the mapping that is set up. Here it is:

/*
 * Given an offset, size, number of disks in the raidz pool,
 * the number of parity "disks" (1, 2, or 3 for raidz, raidz2, raidz3),
 * and the sector size (shift),
 * print a set of stripes.
 */

#include <sys/types.h>
#include <sys/sysmacros.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdio.h>

/*
 * The following are taken straight from usr/src/uts/common/fs/zfs/vdev_raidz.c
 * If they change there, they need to be changed here.
 *
 * a map of columns returned for a given offset and size
 */
typedef struct raidz_col {
    uint64_t rc_devidx;     /* child device index for I/O */
    uint64_t rc_offset;     /* device offset */
    uint64_t rc_size;       /* I/O size */
    void *rc_data;          /* I/O data */
    void *rc_gdata;         /* used to store the "good" version */
    int rc_error;           /* I/O error for this device */
    uint8_t rc_tried;       /* Did we attempt this I/O column? */
    uint8_t rc_skipped;     /* Did we skip this I/O column? */
} raidz_col_t;

typedef struct raidz_map {
    uint64_t rm_cols;       /* Regular column count */
    uint64_t rm_scols;      /* Count including skipped columns */
    uint64_t rm_bigcols;        /* Number of oversized columns */
    uint64_t rm_asize;      /* Actual total I/O size */
    uint64_t rm_missingdata;    /* Count of missing data devices */
    uint64_t rm_missingparity;  /* Count of missing parity devices */
    uint64_t rm_firstdatacol;   /* First data column/parity count */
    uint64_t rm_nskip;      /* Skipped sectors for padding */
    uint64_t rm_skipstart;  /* Column index of padding start */
    void *rm_datacopy;      /* rm_asize-buffer of copied data */
    uintptr_t rm_reports;       /* # of referencing checksum reports */
    uint8_t rm_freed;       /* map no longer has referencing ZIO */
    uint8_t rm_ecksuminjected;  /* checksum error was injected */
    raidz_col_t rm_col[1];      /* Flexible array of I/O columns */
} raidz_map_t;

/*
 *  vdev_raidz_map_get() is hacked from vdev_raidz_map_alloc() in
 *  usr/src/uts/common/fs/zfs/vdev_raidz.c.  If that routine changes,
 *  this might also need changing.
 */

raidz_map_t *
vdev_raidz_map_get(uint64_t size, uint64_t offset, uint64_t unit_shift, 
           uint64_t dcols, uint64_t nparity)
{
    raidz_map_t *rm;
    uint64_t b = offset >> unit_shift;
    uint64_t s = size >> unit_shift;
    uint64_t f = b % dcols;
    uint64_t o = (b / dcols) << unit_shift;
    uint64_t q, r, c, bc, col, acols, scols, coff, devidx, asize, tot;

    q = s / (dcols - nparity);
    r = s - q * (dcols - nparity);
    bc = (r == 0 ? 0 : r + nparity);
    tot = s + nparity * (q + (r == 0 ? 0 : 1));

    if (q == 0) {
        acols = bc;
        scols = MIN(dcols, roundup(bc, nparity + 1));
    } else {
        acols = dcols;
        scols = dcols;
    }

    rm = malloc(offsetof(raidz_map_t, rm_col[scols]));

    if (rm == NULL) {
        fprintf(stderr, "malloc failed\n");
        exit(1);
    }

    rm->rm_cols = acols;
    rm->rm_scols = scols;
    rm->rm_bigcols = bc;
    rm->rm_skipstart = bc;
    rm->rm_missingdata = 0;
    rm->rm_missingparity = 0;
    rm->rm_firstdatacol = nparity;
    rm->rm_datacopy = NULL;
    rm->rm_reports = 0;
    rm->rm_freed = 0;
    rm->rm_ecksuminjected = 0;

    asize = 0;

    for (c = 0; c < scols; c++) {
        col = f + c;
        coff = o;
        if (col >= dcols) {
            col -= dcols;
            coff += 1ULL << unit_shift;
        }
        rm->rm_col[c].rc_devidx = col;
        rm->rm_col[c].rc_offset = coff;
        rm->rm_col[c].rc_data = NULL;
        rm->rm_col[c].rc_gdata = NULL;
        rm->rm_col[c].rc_error = 0;
        rm->rm_col[c].rc_tried = 0;
        rm->rm_col[c].rc_skipped = 0;

        if (c >= acols)
            rm->rm_col[c].rc_size = 0;
        else if (c < bc)
            rm->rm_col[c].rc_size = (q + 1) << unit_shift;
        else
            rm->rm_col[c].rc_size = q << unit_shift;

        asize += rm->rm_col[c].rc_size;
    }

    rm->rm_asize = roundup(asize, (nparity + 1) << unit_shift);
    rm->rm_nskip = roundup(tot, nparity + 1) - tot;

    if (rm->rm_firstdatacol == 1 && (offset & (1ULL << 20))) {
        devidx = rm->rm_col[0].rc_devidx;
        o = rm->rm_col[0].rc_offset;
        rm->rm_col[0].rc_devidx = rm->rm_col[1].rc_devidx;
        rm->rm_col[0].rc_offset = rm->rm_col[1].rc_offset;
        rm->rm_col[1].rc_devidx = devidx;
        rm->rm_col[1].rc_offset = o;

        if (rm->rm_skipstart == 0)
            rm->rm_skipstart = 1;
    }

    return (rm);

}

int
main(int argc, char *argv[])
{
    uint64_t offset = 0;
    uint64_t size = 0;
    uint64_t dcols = 0;
    uint64_t nparity = 1;
    uint64_t unit_shift = 9;  /* shouldn't be hard-coded.  sector size */
    raidz_map_t *rzm;
    raidz_col_t *cols;
    int i;

    if (argc < 4) {
        fprintf(stderr, "Usage: %s offset size ndisks [nparity [ashift]]\n", argv[0]);
        fprintf(stderr, "  ndisks is number of disks in raid pool, including parity\n");
        fprintf(stderr, "  nparity defaults to 1 (raidz1)\n");
        fprintf(stderr, "  ashift defaults to 9 (512-byte sectors)\n");
        exit(1);
    }

    /* XXX - check return values */
    offset = strtoull(argv[1], NULL, 16);
    size = strtoull(argv[2], NULL, 16);
    dcols = strtoull(argv[3], NULL, 16);

    if (size == 0 || dcols == 0) { /* should check size multiple of ashift...*/
        fprintf(stderr, "size and/or number of columns must be > 0\n");
        exit(1);
    }

    if (argc > 4)
        nparity = strtoull(argv[4], NULL, 16);

    if (argc == 6)
        unit_shift = strtoull(argv[5], NULL, 16);

    rzm = vdev_raidz_map_get(size, offset, unit_shift, dcols, nparity);

    printf("cols = %d, firstdatacol = %d\n", rzm->rm_cols, rzm->rm_firstdatacol);
    for (i = 0, cols = &rzm->rm_col[0]; i < rzm->rm_cols; i++, cols++)
        printf("%d:%lx:%lx\n", cols->rc_devidx, cols->rc_offset, cols->rc_size);

    exit(0);
}

The program takes an offset, size, number of disks in the pool, and optionally the number of parity (1 for raidz, 2 for raidz2 and 3 for raidz3) and the sector size (shift), and outputs the location of the blocks on the underlying disks. Let's try it.

# gcc -m64 raidzdump.c -o raidzdump
# ./raidzdump 14c00 20000 5
cols = 5, firstdatacol = 1
1:4200:8000
2:4200:8000
3:4200:8000
4:4200:8000
0:4400:8000

The parity for the block is on disk1 (/var/tmp/f1), the first 32k of data on disk2 (/var/tmp/f2), the second on disk3, etc. We could use zdb(1M) to check this, except there is a bug (see Bug #3659). But the following works on older versions of illumos and on Solaris 11.

# zdb -R rzpool 0.2:4200:8000:r
Found vdev: /var/tmp/f2
assertion failed for thread 0xfffffd7fff172a40, thread-id 1: vd->vdev_parent == (pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev), file ../../../uts/common/fs/zfs/zio.c, line 827
Abort (core dumped)
#

This says to go to vdev 2 (/var/tmp/f2, child of the root vdev 0), at location 0x4200, and read 0x8000 (32k) of data and display it.

Since zdb(1M) is currently broken for this, let's try a different way. We'll add the 0x4200 to the end of the disk label of 4MB at the beginning of every disk to get an absolute byte offset within the disk, then we'll use dd to look at the data.

# mdb
4200+400000=E
    4211200
$q
#
# dd if=/var/tmp/f2 bs=1 iseek=4211200 count=32k
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
...

And there is the first 32k of the words file. To get the next 32k, use the offset from the third line of output from the raidzdump utility, and so on. This gets more interesting with smaller blocks, but that is left as an exercise for the reader. For instance, a write of a 512-byte file will stripe across 2 disks, one for the data, the other for parity. Note that I have not tested with raidz2 or raidz3. I expect the first data column to be 2 and 3 respectively, but the code should work... You need to specify the parity (2 or 3) to raidzdump as an argument.

Have fun!

:

Sign up Now for Instant Cloud Access

Get Started

View PricingSee Benchmarks