Common pitfall when benchmarking ZFS with fio
Let's say you build yourself a new ZFS pool on top of some pretty fast NVMe drives and want to benchmark it to see how well it can run. You create a zvol and fire up fio to sequentially read some data from it. But anticipating a large number of IOPS, you don't want your CPU to bottleneck the performance, so naturally you include --numjobs=8 to be sure you get the most out of your NAND gates. Fio completes and TA-DAH: IOPS up the roof. But wait a minute... Your pool consisting of three NVMe drives, each capable of 3.2 GB/s sequential read, is read at the rate of 24 GB/s! Obviously the disk vendor would not undermine its product perfromance, so something might be wrong with the test.
# ./fio-3-28-115 --name=test --rw=read --bs=1m --direct=1 --ioengine=libaio\ --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8 (...) read: IOPS=24.4k, BW=23.9GiB/s (25.6GB/s)(800GiB/33516msec)
Well, reading zeros from freshly created volume would be quick, so let's write some data onto it:
# ./fio-3-28-115 --name=test --rw=write --bs=1m --direct=1 --ioengine=libaio\ --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8 (...) write: IOPS=14.0k, BW=13.7GiB/s (14.7GB/s)(800GiB/58562msec)
That was fast! At this rate it should finish in just 8 seconds, but somehow it took a couple of times longer. Anyway, let's check the reads now:
# ./fio-3-28-115 --name=test --rw=read --bs=1m --direct=1 --ioengine=libaio\ --size=100G --group_reporting --filename=/dev/tank/bucket --numjobs=8 (...) read: IOPS=9320, BW=9320MiB/s (9773MB/s)(800GiB/87894msec)
That seems more reasonable. But still a bit too fast. Checking the zpool iostat reveals another interesting fact:
# zpool iostat 5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- tank 100G 20.9T 1.02K 772 102M 85.1M tank 100G 20.9T 10.1K 26 1.10G 153K tank 100G 20.9T 10.4K 28 1.13G 146K tank 100G 20.9T 11.4K 19 1.22G 112K
Hmm, it's only reading the underlying storage at the rate approximately 8 times lower than fio actually shows. Why is that? Well cache of course! Specifically, the ZFS ARC. When fio is run with multiple jobs, it starts them in separate processes, each starting reading from the same disk sector going up. When ZFS serves IO request from one of the threads, the data lands in the ARC, so when other processes ask for it, the IO request bypasses the storage and comes straight from RAM. How to counter it? By specifying the offset for each job explicitly:
# ./fio-3-28-115 --rw=read --bs=1m --direct=1 --ioengine=libaio --size=10G\ --group_reporting --filename=/dev/tank/bucket --name=job1 --offset=0G\ --name=job2 --offset=10G --name=job3 --offset=20G --name=job4 --offset=30G\ --name=job5 --offset=40G --name=job6 --offset=50G --name=job7 --offset=60G\ --name=job8 --offset=70G (...) read: IOPS=4174, BW=4175MiB/s (4378MB/s)(80.0GiB/19622msec)
# zpool iostat 5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- tank 100G 20.9T 912 748 86.7M 82.4M tank 100G 20.9T 33.5K 26 4.07G 155K tank 100G 20.9T 36.4K 24 4.42G 143K tank 100G 20.9T 34.1K 29 4.15G 150K
Now that's some reasonable results. The same goes for writes, except in this case ARC is not responsible for the high results. Instead, when multiple threads write to the same location of the disk, the IOs land in the ZIL in RAM untill they are written to the disks. Obviously only the latest request for the particular disk sector is written, however every IO is acknowledged to the fio processes by the ZFS. This resutls in artificially hightened performance:
# ./fio-3-28-115 --rw=write --bs=1m --direct=1 --ioengine=libaio --size=10G\ --group_reporting --filename=/dev/tank/bucket --name=job1 --offset=0G\ --name=job2 --offset=10G --name=job3 --offset=20G --name=job4 --offset=30G\ --name=job5 --offset=40G --name=job6 --offset=50G --name=job7 --offset=60G\ --name=job8 --offset=70G (...) write: IOPS=1800, BW=1800MiB/s (1888MB/s)(80.0GiB/45505msec); 0 zone resets
# zpool iostat 5 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- tank 104G 20.9T 955 748 92.0M 82.4M tank 107G 20.8T 26 13.7K 1.11M 1.66G tank 107G 20.8T 29 15.1K 1.24M 1.81G tank 107G 20.8T 26 15.7K 1.11M 1.61G tank 107G 20.8T 34 18.7K 1.45M 2.09G