November 07, 2011 - by deirdrebstraughan
Part 1 of 5 on Examining File System Latency in Production, by Brendan Gregg, Lead Performance Engineer at Joyent and author with Jim Mauro of “DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD”
When examining system performance problems, we commonly point the finger of blame at the system disks by observing disk I/O. This is a bottom-up approach to performance analysis: starting with the physical devices and moving up through the software stack. Such an approach is standard for system administrators, who are responsible for these physical devices.
To understand the impact of I/O performance on applications, however, file systems can prove to be a better target for analysis than disks. Because modern file systems use more DRAM-based cache and perform more asynchronous disk I/O, the performance perceived by an application can be very different from what’s happening on disk. I’ll demonstrate this by examining the I/O performance of a MySQL database at both the disk and file system level.
We will first discuss the commonly-used approach, disk I/O analysis using iostat(1M), then turn to file system analysis using DTrace.
When trying to understand how disk I/O affects application performance, analysis has historically focused on the performance of storage-level devices: the disks themselves. This includes the use of tools such as iostat(1M), which prints various I/O statistics for disk devices. System administrators either run iostat(1M) directly at the command line, or use it via another interface. Some monitoring software (e.g., munin) will use iostat(1M) to fetch the disk statistics which they then archive and plot.
Here is an iostat(1M) screenshot from a Solaris-based system running a MySQL database, as well as other applications (some extraneous output lines trimmed):
# iostat -xnz 1 10 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.1 33.8 78.8 1208.1 0.0 1.0 0.0 27.8 0 4 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 175.4 0.0 22449.9 0.0 0.0 1.1 0.0 6.1 0 82 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 106.1 379.2 13576.5 22036.7 0.0 2.4 0.0 4.9 0 85 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 139.9 0.0 17912.6 0.0 0.0 1.0 0.0 6.8 0 82 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 176.1 0.0 22538.0 0.0 0.0 1.0 0.0 5.7 0 85 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 208.0 0.0 26619.9 0.0 0.0 1.9 0.0 9.2 0 99 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 208.0 0.0 26624.4 0.0 0.0 1.7 0.0 8.2 0 95 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 106.0 368.9 13566.1 26881.3 0.0 2.7 0.0 5.7 0 93 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 146.0 0.0 18691.3 0.0 0.0 0.9 0.0 6.4 0 88 c0t1d0 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 84.2 0.0 10779.2 0.0 0.0 0.5 0.0 6.1 0 42 c0t1d0
These statistics show an average I/O service time (“asvc_t”) between 4.9 and 9.2 milliseconds and a percent busy (“%b”) rate, reaching 99% busy in one interval. The MySQL database on this server is suffering slow queries (longer than one second), and, based on the iostat(1M) output, you may be able to guess why – the disks. For an application, this looks like a horrible system to be running on.
iostat(1M) shows disk-level performance.
iostat(1M) can be extremely useful, especially the “-x” form of output. However, from the application perspective it can be difficult to interpret and even misleading. Let’s look at some reasons why.
We’ll first examine a simple issue: the system shown above was running MySQL alongside other applications. So, the heavy disk I/O could be caused by – and/or affecting – some other application. What if MySQL was actually caching very well in DRAM and hardly using the disks, while a nightly backup process walked the entire file system, rattling the disks? You might see output like the above, with the slow MySQL queries caused by something else entirely.
iostat(1M)’s disk I/O shows the impact of all processes, not just the one you have in mind.
I’ve worked this issue before, creating psio* and later iosnoop and iotop* to try to identify disk I/O by process and filename. But these tools don’t always succeed in identifying the process and file responsible for particular disk I/O, especially in the ZFS file system. This shortcoming is not easy to fix, leaving us wondering: should we be looking at the disks, or at something else? It helps here to consider all the components of the I/O stack.
Typically, applications are not performing I/O to the disks directly; rather, they do so via a file system. And file systems work hard to prevent applications from suffering disk I/O latency directly, for example by using DRAM to buffer writes, and to cache and prefetch reads.
Here is an example I/O stack showing key components of a file system, based loosely on ZFS:
This diagram shows that there are sources of disk I/O other than what the application is directly (synchronously) requesting. For instance, on the write-side, the application may “dirty” buffers in the file system cache and consider the I/O completed, but the file system doesn’t perform the disk I/O until much later – seconds – by batching together dirty data and writing them in bulk. This was evident in the previous iostat(1M) output with the bursts of writes (see the “kw/s” column) – which does not reflect how the application is actually performing writes.
Apart from other sources of disk I/O adding to the confusion, there is also what happens to the direct I/O itself – particularly with the on-disk layout layer. This is a big topic I won’t go into much here, but I’ll enumerate a single example of I/O inflation to consider:
So, while the application performed a single 1 byte write, the disks performed multiple reads (128 Kbytes worth) and even more writes (over 128 Kbytes worth).
Application I/O to the file system != Disk I/O
It can be even worse than this, for example if the metadata required to reference the location had to be read in the first place, and if the file system (volume manager) employed RAID with a stripe size larger than the record size.
Having mentioned inflation, I should also mention that deflation is possible (in case it isn’t obvious). Causes can include caching in DRAM to satisfy reads, and cancellation of buffered writes (data rewritten before it has been flushed to disk).
I’ve recently been watching production ZFS file systems running with a cache hit rate of over 99.9%, meaning that only a trickle of reads are actually reaching disk.
To summarize so far: looking at how hard the disks are rattling, as we did above using iostat(1M), tells us very little about what the target application is actually experiencing. Application I/O can be inflated or deflated by the file system by the time it reaches disks, making difficult at best a direct correlation between disk and application I/O. Disk I/O also includes requests from other file system components, such as prefetch, the background flusher, on-disk layout metadata, and other users of the file system (other applications). Even if you find an issue at the disk level, it’s hard to tell how much that matters to the application in question.
iostat(1M) includes other file system I/O, which may not directly affect the performance of the target application.
Summarizing the previous issues:
iostat(1M) shows disk level performance, not file system performance.
In the next post, Invisible Issues of I/O, I’ll show how file system performance can be analyzed.
photo by razvan.orendovici
Part 1: When iostat Leads You Astray
Part 2: Invisible Issues of I/O
Part 4: Drilling Down Into the Kernel
Part 5: Presenting File System Latency