November 14, 2011 - by deirdrebstraughan
Part 2 of 5 on Examining File System Latency in Production, by Brendan Gregg, Lead Performance Engineer at Joyent and author with Jim Mauro of “DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD”
I previously explained why disk I/O is difficult to associate with an application, and why it can be altered from what the application requested. Now I’ll focus more on the file system, and show why it can be important to study I/O that never even reaches disk.
What matters most to the application is the latency of its requests to the file system, which can be measured in this part of the stack (see Part 1 for the full diagram):
Measuring I/O and latency at this level is much more interesting, where it is directly affecting the application.
If we can also examine application context during the latency – to see whether it’s occurring during a sensitive code-path – then we can answer with certainty whether there is a file system issue affecting the application or not, and whether that’s worth investigating further. Being able to answer this early in the diagnosis phase can be immensely useful, so that we start down the correct path more quickly.
Apart from being more relevant to the application than disk I/O, file system I/O also includes of other phenomena that can be worth examining, including cache hits, lock latency, additional queueing, and disk-cache flush latency.
Reads and writes may be served from the file system main memory cache instead of disk (if present, enabled and eligible for the I/O type). Main memory is typically DRAM, and we call these main memory reads “cache hits”. For reads:
Since these cache hits don’t reach the disk, they are never observable using iostat(1M). They will be visible when tracing at the file system level, and, if the file system cache is performing well, you may see orders of magnitude more I/O than at the disk level.
If cache hits are “good” I/O, it may not be immediately obvious why we’d even want to see them. Here are three reasons to consider:
What if the I/O was slow due to file system lock contention, even though no disk I/O was involved?
File systems employ locks to ensure data integrity in a multi-threaded environment. The latency incurred with lock contention will be included when tracing at this level. Time spent waiting to acquire the lock could dominate I/O latency as seen by the application:
In this case, the disks paint a rosier picture than reality, as their latency could be dwarfed by lock latency. While this is unlikely, it could happen (and has in the past). When chasing down mysterious I/O latency, you don’t want to leave any stone unturned.
High lock wait (contention) could happen for a number of reasons, including extreme I/O conditions or file system bugs. Lock and other sources of file system latency won’t be visible from iostat(1M).
If you only use iostat(1M), you may be flying blind regarding lock and other file system issues.
There is one other latency source that iostat(1M) does show directly: waiting on one of the I/O queues. I’ll dig into queueing a little here, and explain why we need to return to file system latency.
I/O can be queued in the kernel before it is issued to disk.
I’ve been trying to describe file systems generically to avoid getting sidetracked into implementation and internals, but here I’ll dip into ZFS a little. On Solaris-based systems, an I/O can queue in the ZFS I/O pipeline (“ZIO pipeline”), and then ZFS vdev queues, and finally a SCSI “sd” block device driver queue. iostat(1M)’s “wsvc_t” does show queue latency for the sd driver (and the “wait” and “%w” columns relate to this queue as well), but these don’t reflect ZFS queueing.
So, iostat(1M) gets a brief reprieve – it doesn’t show just disk I/O latency, but also block device driver queue latency.
However, similarly to disk I/O latency, queue latency may not matter unless the application is waiting for that I/O to complete. To understand this from the application perspective, we are still best served by measuring latency at the file system level – which will include any queueing latency from any queue that the application I/O has synchronously waited for.
Some file systems ensure that synchronous write I/O – where the application has requested that the write not complete until it is on stable storage – really is on stable storage. (ZFS may actually be the only file system that currently does this by default.) It can work by sending SCSI cache flush commands to the disk devices, and not completing the application I/O until the cache flush command has completed. This ensures that the data really is on stable storage – and not just buffered.
The application is actually waiting for the SCSI flush command to complete, a condition not (currently) included in iostat(1M). This means that the application can be suffering write latency issues actually caused by disk latency that are invisible via iostat(1M). I’ve wrestled with this issue before, and have included scripts in the DTrace book to show the SCSI cache flush latency.
If the latency is measured at the file system interface, this latency will include cache flush commands.
Part 1 showed how application storage I/O was confusing to understand from the disk level. In this part I showed some scenarios where issues are just not visible.
This isn’t really a failing of iostat(1M), which is a great tool for system administrators to understand the usage of their resources. But applications are far, far away from the disks, with a complex file system in between. For application analysis, iostat(1M) may provide clues that disks could be causing issues, but, in order to directly associate latency with the application, you really need to measure at the file system level, and consider other file system latency issues.
In the next section I’ll measure file system latency on a running application (MySQL).
Part 1: When iostat Leads You Astray
Part 2: Invisible Issues of I/O
Part 4: Drilling Down Into the Kernel
Part 5: Presenting File System Latency