January 10, 2014 - by Brendan Gregg
"I'm giving a webinar about benchmarking the cloud on February 6th, where I'll explain the process in detail, and help you benchmark reliably and effectively! Sign up here."
Benchmarking, and benchmarking the cloud, is incredibly error prone. I provided guidance though this minefield in the benchmarking chapter of my book (Systems Performance: Enterprise and the Cloud); that chapter can be read online on the InformIT site. I also gave a lightning talk about benchmarking gone wrong at Surge last year. In this post, I'm going to cut to the chase and show you the tools I commonly use for basic cloud benchmarking.
As explained in the benchmarking chapter, I do not run these tools passively. I perform Active Benchmarking, where I use a variety of other observability tools while the benchmark is running, to confirm that it is measuring what it is supposed to. For perform reliable benchmarking, you need to be rigorous. For some suggestions of observability tools that you can use, try starting with the OS checklists (Linux, Solaris, etc.) from the USE Method.
The aim here is to benchmark the performance of cloud instances, either for evaluations, capacity planning, or for troubleshooting performance issues. My approach is to use micro-benchmarks, where a single component or activity is tested, and to test on different resource dimensions: CPUs, networking, file systems. The results can then be mapped à la carte: I may be investigating an production application workload that has high network throughput, moderate CPU usage, and negligible file system usage, and so I can weigh the importance of each accordingly. Additional goals for testing these dimensions in the cloud environment are listed in the following sections.
For CPUs, this is what I'd like to test, and why:
For single-threaded performance, I start by hacking up an assembly program to investigate instruction retire rates, and disassemble the binary to confirm what is being measured. That gives me a baseline result for how fast the CPUs really are. I'll write up that process when I get a chance.
sysbench can test single-threaded and multi-threaded performance by calculating prime numbers. This also brings memory I/O into play. You need to be running the same version of sysbench, with the same compilation options, to be able to compare results. Testing from 1 to 8 threads:
sysbench --num-threads=1 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=2 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=4 --test=cpu --cpu-max-prime=25000 run sysbench --num-threads=8 --test=cpu --cpu-max-prime=25000 run
The value for cpu-max-prime should be chosen so that the benchmark runs for at least 10 seconds. I don't test for longer than 60 seconds, unless I'm looking for systemic perturbations like cron jobs.
I'll run the same multi-threaded sysbench invocation a number of times, to look for repeatability. This could vary based on scheduler placement, CPU affinity, and memory groups.
The single-threaded results are important for single-threaded (or effectively single-threaded) applications, like node.js. Multi-threaded for applications like MySQL server.
While sysbench is running, you'll want to analyze CPU usage. For example, on Linux, I'd use mpstat, sar, pidstat, and perf. On SmartOS, I'd use mpstat, prstat, and DTrace profiling.
For networking, this is what I'd like to test, and why:
iperf works well for this. Example commands:
# server iperf -s -l 128k # client, 1 thread iperf -c server_IP -l 128k -i 1 -t 30 # client, 2 threads iperf -c server_IP -P 2 -l 128k -i 1 -t 30
Here I've included -i 1 to print per-second summaries, so I can watch for variance.
While iperf is running, you'll want to analyze network and CPU usage. On Linux, I'd use nicstat, sar, and pidstat. On SmartOS, I'd use nicstat, mpstat, and prstat.
For file systems, this is what I'd like to test, and why:
By "medium", I mean a working set size somewhat larger than the instance memory size. Eg, for a 1 Gbyte instance, I'd create a total file set of 10 Gbytes, with a non-uniform access distribution so that it has a cache hit ratio in the 90%s. These characteristics are chosen to match what I've seen are typical of the cloud. If you know what your total file size will be, working set size, and access distribution, then by all means test that instead.
I've been impressed by fio by Jens Axboe. Here's how I'd use it:
# throw-away: 5 min warm-up fio --runtime=300 --time_based --clocksource=clock_gettime --name=randread \ --numjobs=8 --rw=randread --random_distribution=pareto:0.9 --bs=8k \ --size=10g --filename=fio.tmp # file system random I/O, 10 Gbytes, 1 thread fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread --numjobs=1 --rw=randread --random_distribution=pareto:0.9 --bs=8k \ --size=10g --filename=fio.tmp # file system random I/O, 10 Gbytes, 8 threads fio --runtime=60 --time_based --clocksource=clock_gettime --name=randread \ --numjobs=8 --rw=randread --random_distribution=pareto:0.9 --bs=8k \ --size=10g --filename=fio.tmp
This is all about finding the "Goldilocks" working set size. People often test too small or too big:
These tests are useful if and only if you can explain why the results are interesting: how they map to your production environment.
While fio is running, you'll want to analyze file system, disk, and CPU usage. On Linux, I'd use sar, iostat, pidstat, and a profiler (perf, ...). On SmartOS, I'd use vfsstat, iostat, prstat, and DTrace.
There are many more benchmarking tools for different targets. In my experience, it's best to assume that they are all broken or misleading until proven otherwise, by use of analysis and sanity checking.
Watch the webinar where I explain the process in detail: