Common pitfall when benchmarking ZFS with fio

Let's say you build yourself a new ZFS pool on top of some pretty fast NVMe drives and want to benchmark it to see how well it can run. You create a zvol and fire up fio to sequentially read some data from it. But anticipating a large number of IOPS, you don't want your CPU to bottleneck the performance, so naturally you include --numjobs=8 to be sure you get the most out of your NAND gates. Fio completes and TA-DAH: IOPS up the roof. But wait a minute... Your pool consisting of three NVMe drives, each capable of 3.2 GB/s sequential read, is read at the rate of 24 GB/s! Obviously the disk vendor would not undermine its product perfromance, so something might be wrong with the test.

# ./fio-3-28-115 --name=test --rw=read --bs=1m --direct=1 --ioengine=libaio\
  --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8
  (...)
    read: IOPS=24.4k, BW=23.9GiB/s (25.6GB/s)(800GiB/33516msec)

Well, reading zeros from freshly created volume would be quick, so let's write some data onto it:

# ./fio-3-28-115 --name=test --rw=write --bs=1m --direct=1 --ioengine=libaio\
  --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8
  (...)
    write: IOPS=14.0k, BW=13.7GiB/s (14.7GB/s)(800GiB/58562msec)

That was fast! At this rate it should finish in just 8 seconds, but somehow it took a couple of times longer. Anyway, let's check the reads now:

# ./fio-3-28-115 --name=test --rw=read --bs=1m --direct=1 --ioengine=libaio\
  --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8
  (...)
    read: IOPS=9320, BW=9320MiB/s (9773MB/s)(800GiB/87894msec)

That seems more reasonable. But still a bit too fast. Checking the zpool iostat reveals another interesting fact:

# zpool iostat 5

                capacity     operations     bandwidth
  pool        alloc   free   read  write   read  write
  ----------  -----  -----  -----  -----  -----  -----
  tank         100G  20.9T  1.02K    772   102M  85.1M
  tank         100G  20.9T  10.1K     26  1.10G   153K
  tank         100G  20.9T  10.4K     28  1.13G   146K
  tank         100G  20.9T  11.4K     19  1.22G   112K

Hmm, it's only reading the underlying storage at the rate approximately 8 times lower than fio actually shows. Why is that? Well cache of course! Specifically, the ZFS ARC. When fio is run with multiple jobs, it starts them in separate processes, each starting reading from the same disk sector going up. When ZFS serves IO request from one of the threads, the data lands in the ARC, so when other processes ask for it, the IO request bypasses the storage and comes straight from RAM. How to counter it? By specifying the offset for each job explicitly:

# ./fio-3-28-115 --rw=read --bs=1m --direct=1 --ioengine=libaio --size=10G\
  --group_reporting --filename=/dev/tank/bucket --name=job1 --offset=0G\
  --name=job2 --offset=10G --name=job3 --offset=20G --name=job4 --offset=30G\
  --name=job5 --offset=40G --name=job6 --offset=50G --name=job7 --offset=60G\
  --name=job8 --offset=70G
  (...)
    read: IOPS=4174, BW=4175MiB/s (4378MB/s)(80.0GiB/19622msec)
# zpool iostat 5
                capacity     operations     bandwidth
  pool        alloc   free   read  write   read  write
  ----------  -----  -----  -----  -----  -----  -----
  tank         100G  20.9T    912    748  86.7M  82.4M
  tank         100G  20.9T  33.5K     26  4.07G   155K
  tank         100G  20.9T  36.4K     24  4.42G   143K
  tank         100G  20.9T  34.1K     29  4.15G   150K

Now that's some reasonable results. The same goes for writes, except in this case ARC is not responsible for the high results. Instead, when multiple threads write to the same location of the disk, the IOs land in the ZIL in RAM untill they are written to the disks. Obviously only the latest request for the particular disk sector is written, however every IO is acknowledged to the fio processes by the ZFS. This resutls in artificially hightened performance:

# ./fio-3-28-115 --rw=write --bs=1m --direct=1 --ioengine=libaio --size=10G\
  --group_reporting --filename=/dev/tank/bucket --name=job1 --offset=0G\
  --name=job2 --offset=10G --name=job3 --offset=20G --name=job4 --offset=30G\
  --name=job5 --offset=40G --name=job6 --offset=50G --name=job7 --offset=60G\
  --name=job8 --offset=70G
  (...)
    write: IOPS=1800, BW=1800MiB/s (1888MB/s)(80.0GiB/45505msec); 0 zone resets
# zpool iostat 5
                capacity     operations     bandwidth
  pool        alloc   free   read  write   read  write
  ----------  -----  -----  -----  -----  -----  -----
  tank         104G  20.9T    955    748  92.0M  82.4M
  tank         107G  20.8T     26  13.7K  1.11M  1.66G
  tank         107G  20.8T     29  15.1K  1.24M  1.81G
  tank         107G  20.8T     26  15.7K  1.11M  1.61G
  tank         107G  20.8T     34  18.7K  1.45M  2.09G