Active Benchmarking in the Cloud

July 31, 2014 - by Max Bruning

Recently, Joyent published a series of benchmarks using YCSB (Yahoo Cloud Service Benchmark) comparing the performance of MongoDB, PostgreSQL, and Elasticsearch between machines running in the Joyent Public Cloud and similarly configured machines running on Amazon Web Services.

When I looked at the results, my first impression was that the numbers look good, but how can I tell what is really being benchmarked? To do this, I needed to do "active benchmarking". This basically means that I wanted to analyze the performance of the system while it was being tested. I decided to look specifically at the YCSB benchmark that used MongoDB.

Rather than use shards, a router, and the entire setup that is used in the published results, I used a single machine in the Joyent cloud that ran the YCSB benchmark with another machine in the Joyent cloud that was running MongoDB. The machine running YCSB is a standard 4GB instance, and the machine running MongoDB is a standard 7.5GB system.

In this blog, we'll first simply run the benchmark and watch results. Then we'll do it again, but do active benchmarking while the benchmark is running. We'll do this to try to determine what are the limiting factors (i.e., what is the bottleneck for the test results). We'll also take a look at scaling issues.

First, we'll run the benchmark to load 6 million records, each record consisting of 10 fields of 1 hundred bytes each, creating 1k of data per record.

  # time ./bin/ycsb load mongodb -s -P workloads/workloada -p recordcount=6000000 -p  mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p   operationcount=6000000
  YCSB Client 0.1
  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -load
  Loading workload...
  Starting test.
   0 sec: 0 operations;
  mongo connection created with 199.192.241.168:27017/ycsb
   10 sec: 17563 operations; 1753.49 current ops/sec; [INSERT AverageLatency(us)=302209.69]
   20 sec: 73255 operations; 5525.55 current ops/sec; [INSERT AverageLatency(us)=48674.83]
   30 sec: 147128 operations; 7385.82 current ops/sec; [INSERT AverageLatency(us)=106637.21]
   40 sec: 228286 operations; 8114.18 current ops/sec; [INSERT AverageLatency(us)=117745.16]
  ...
   640 sec: 4993046 operations; 8210.88 current ops/sec; [INSERT AverageLatency(us)=123054.32]
   650 sec: 5074073 operations; 8102.7 current ops/sec; [INSERT AverageLatency(us)=120990.57]
   660 sec: 5150469 operations; 7638.84 current ops/sec; [INSERT AverageLatency(us)=127621.01] [CLEANUP AverageLatency(us)=26]
   670 sec: 5223843 operations; 7336.67 current ops/sec; [INSERT AverageLatency(us)=144175.32] [CLEANUP AverageLatency(us)=6.33]
   680 sec: 5297258 operations; 7340.77 current ops/sec; [INSERT AverageLatency(us)=131909.57] [CLEANUP AverageLatency(us)=4.67]
   690 sec: 5376306 operations; 7904.01 current ops/sec; [INSERT AverageLatency(us)=124980.39] [CLEANUP AverageLatency(us)=5.62]
  ...
  [OVERALL], RunTime(ms), 771609.0
  [OVERALL], Throughput(ops/sec), 7775.9590673514695
  [INSERT], Operations, 6000000
  [INSERT], AverageLatency(us), 122792.53285266667
  [INSERT], MinLatency(us), 401
  [INSERT], MaxLatency(us), 7160628
  [INSERT], Return=0, 6000000
  [INSERT], 0, 4387109
  [INSERT], 1, 435791
  ...
  [CLEANUP], Operations, 1000
  [CLEANUP], AverageLatency(us), 70.878
  [CLEANUP], MinLatency(us), 1
  [CLEANUP], MaxLatency(us), 66032
  [CLEANUP], 95thPercentileLatency(ms), 0
  [CLEANUP], 99thPercentileLatency(ms), 0
  [CLEANUP], 0, 999
  [CLEANUP], 1, 0
  [CLEANUP], 2, 0
  ...
  real      12m52.267s
  user  13m33.122s
  sys   3m24.784s
  #

So loading 6,000,000 records took 12 minutes, 52 seconds.

Now we'll run workloada (a mixture of 50/50 reads/writes).

  # time ./bin/ycsb run mongodb -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000
  java -cp /root/YCSB-master/infinispan/src/main/conf:/root/YCSB-master/mongodb/target/mongodb-binding-0.1.4.jar:/root/YCSB-master/mongodb/target/archive-tmp/mongodb-binding-0.1.4.jar:/root/YCSB-master/jdbc/src/main/conf:/root/YCSB-master/gemfire/src/main/conf:/root/YCSB-master/core/target/core-0.1.4.jar:/root/YCSB-master/dynamodb/conf:/root/YCSB-master/nosqldb/src/main/conf:/root/YCSB-master/voldemort/src/main/conf:/root/YCSB-master/hbase/src/main/conf com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
  YCSB Client 0.1
  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
  Loading workload...
  Starting test.
   0 sec: 0 operations;
  mongo connection created with 199.192.241.168:27017/ycsb
   10 sec: 11377 operations; 1130.24 current ops/sec; [UPDATE AverageLatency(us)=747864.1] [READ AverageLatency(us)=379445.04]
   20 sec: 60430 operations; 4877.98 current ops/sec; [UPDATE AverageLatency(us)=86604.21] [READ AverageLatency(us)=69050.56]
   30 sec: 135550 operations; 7480.58 current ops/sec; [UPDATE AverageLatency(us)=96481.73] [READ AverageLatency(us)=94490]
   40 sec: 229001 operations; 9344.17 current ops/sec; [UPDATE AverageLatency(us)=68793.49] [READ AverageLatency(us)=64952.19]
   50 sec: 328821 operations; 9920.49 current ops/sec; [UPDATE AverageLatency(us)=57943.78] [READ AverageLatency(us)=61490.36]
   60 sec: 442787 operations; 11335.39 current ops/sec; [UPDATE AverageLatency(us)=68971.15] [READ AverageLatency(us)=60517.94]
   70 sec: 555053 operations; 11201.96 current ops/sec; [UPDATE AverageLatency(us)=60592.3] [READ AverageLatency(us)=33707.96]
   80 sec: 671583 operations; 11603.11 current ops/sec; [UPDATE AverageLatency(us)=74702.3] [READ AverageLatency(us)=46173.29]
   ...
   [OVERALL], RunTime(ms), 522700.0
  [OVERALL], Throughput(ops/sec), 11478.859766596519
  [UPDATE], Operations, 2999228
  [UPDATE], AverageLatency(us), 85524.36139733292
  [UPDATE], MinLatency(us), 394
  [UPDATE], MaxLatency(us), 14435871
  [UPDATE], 95thPercentileLatency(ms), 365
  [UPDATE], Return=0, 2999228
  [UPDATE], 0, 869530
  ...
  [READ], Operations, 3000772
  [READ], AverageLatency(us), 29704.56337702431
  [READ], MinLatency(us), 353
  [READ], MaxLatency(us), 14292934
  [READ], 95thPercentileLatency(ms), 115
  [READ], 99thPercentileLatency(ms), 345
  [READ], Return=0, 3000772
  [READ], 0, 706416
  ...
  [CLEANUP], Operations, 1000
  [CLEANUP], AverageLatency(us), 102.235
  [CLEANUP], MinLatency(us), 1
  [CLEANUP], MaxLatency(us), 81274
  [CLEANUP], 95thPercentileLatency(ms), 0
  [CLEANUP], 99thPercentileLatency(ms), 0
  [CLEANUP], 0, 998
  [CLEANUP], 1, 0
  ...
  real     8m43.573s
  user   13m53.838s
  sys   3m40.015s
  #

So in 6 million read/write operations, there was 3,000,772 reads and 2,999,228 writes (close enough to a 50/50 split). Reads took on average about 29.7 msecs, with a minimum latency of 394 usecs and a maximum latency of 14 seconds. Updates (writes) took about the same amount of time.

Let's try workloadb, a 95/5 split between reads and writes (95% reads).

  # time ./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000
  ...
  YCSB Client 0.1
  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t
  Loading workload...
  Starting test.
   0 sec: 0 operations;
  mongo connection created with 199.192.241.168:27017/ycsb
   10 sec: 39561 operations; 3937.99 current ops/sec; [UPDATE AverageLatency(us)=64134.53] [READ AverageLatency(us)=77735.02]
   20 sec: 128222 operations; 8806.22 current ops/sec; [UPDATE AverageLatency(us)=33152.18] [READ AverageLatency(us)=37736.67]
   30 sec: 202620 operations; 7342.87 current ops/sec; [UPDATE AverageLatency(us)=75466.98] [READ AverageLatency(us)=58739.72]
   40 sec: 286881 operations; 8426.1 current ops/sec; [UPDATE AverageLatency(us)=52512.36] [READ AverageLatency(us)=43191.99]
   50 sec: 381094 operations; 9420.36 current ops/sec; [UPDATE AverageLatency(us)=31639.92] [READ AverageLatency(us)=26998.68]
   60 sec: 483100 operations; 10199.58 current ops/sec; [UPDATE AverageLatency(us)=33720.43] [READ AverageLatency(us)=52603.76]
   70 sec: 587825 operations; 10418.32 current ops/sec; [UPDATE AverageLatency(us)=42287.56] [READ AverageLatency(us)=37719.84]
  ...
  [OVERALL], RunTime(ms), 559369.0
  [OVERALL], Throughput(ops/sec), 10726.37203706319
  [UPDATE], Operations, 299901
  [UPDATE], AverageLatency(us), 44403.52669714339
  [UPDATE], MinLatency(us), 379
  [UPDATE], MaxLatency(us), 3324334
  [UPDATE], 95thPercentileLatency(ms), 138
  [UPDATE], 99thPercentileLatency(ms), 740
  [UPDATE], Return=0, 299901
  [UPDATE], 0, 145122
  [UPDATE], 1, 14669
  ...
  [READ], Operations, 5700099
  [READ], AverageLatency(us), 37128.780924682185
  [READ], MinLatency(us), 329
  [READ], MaxLatency(us), 3876008
  [READ], 95thPercentileLatency(ms), 137
  [READ], 99thPercentileLatency(ms), 381
  [READ], Return=0, 5700099
  [READ], 0, 2938426
  [READ], 1, 193083
  ...
  [CLEANUP], Operations, 1000
  [CLEANUP], AverageLatency(us), 95.518
  [CLEANUP], MinLatency(us), 1
  [CLEANUP], MaxLatency(us), 91746
  [CLEANUP], 95thPercentileLatency(ms), 0
  [CLEANUP], 99thPercentileLatency(ms), 0
  [CLEANUP], 0, 999
  [CLEANUP], 1, 0
  [CLEANUP], 2, 0
  ...

  real        9m19.973s
  user   15m37.622s
  sys   3m28.291s
  #

Here, there are 5,700,099 read and 299,901 updates. Roughly a 95/5 split as workloadb specifies. The remaining workloads yield similar results.

Now we'll run the benchmark and begin to do some active benchmarking. For starters, let's use prstat(1M) on the system where mongo is running.

  # prstat -c
  Please wait...
     PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   24426 mongodb    70G  244M cpu21   60    0   3:06:37  12% mongod/2014
   48484 root     3256K 2036K sleep   59    0   0:00:04 0.0% bash/1
   42051 root     3272K 2248K sleep   59    0   0:00:00 0.0% bash/1
   48278 root     5560K 1388K sleep   59    0   0:00:11 0.0% sshd/1
   61323 root     7360K 2376K sleep   59    0   0:00:49 0.0% rsyslogd/6
   41639 root     5560K 2904K sleep   59    0   0:00:00 0.0% sshd/1
   59741 root     2676K 1388K sleep   59    0   0:00:06 0.0% init/1
   61348 root     1652K 1008K sleep   59    0   0:00:04 0.0% utmpd/1
   61357 root     1940K 1028K sleep   59    0   0:00:03 0.0% ttymon/1
   59626 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
   41641 root     6224K 3744K sleep   59    0   0:00:00 0.0% sshd/1
   61359 root     2276K 1060K sleep   59    0   0:00:02 0.0% ttymon/1
   61082 root     2488K 1276K sleep   59    0   0:00:02 0.0% pfexecd/3
   59793 root     8824K 6256K sleep   59    0   0:00:36 0.0% svc.configd/13
   61378 root     4340K 1496K sleep   59    0   0:00:10 0.0% sshd/1
  Total: 23 processes, 2100 lwps, load averages: 12.39, 7.57, 3.78
     PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
   24426 mongodb    70G  242M cpu22   52    0   3:06:49  12% mongod/2014
   76885 root     4064K 3276K cpu6    59    0   0:00:00 0.0% prstat/1
   48484 root     3256K 2036K sleep   59    0   0:00:04 0.0% bash/1
   42051 root     3272K 2248K sleep   59    0   0:00:00 0.0% bash/1
   48278 root     5560K 1388K sleep   59    0   0:00:11 0.0% sshd/1
   61323 root     7360K 2376K sleep   59    0   0:00:49 0.0% rsyslogd/6
   41639 root     5560K 2904K sleep   59    0   0:00:00 0.0% sshd/1
   59741 root     2676K 1388K sleep   59    0   0:00:06 0.0% init/1
   61348 root     1652K 1008K sleep   59    0   0:00:04 0.0% utmpd/1
   61357 root     1940K 1028K sleep   59    0   0:00:03 0.0% ttymon/1
   59626 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1
   41641 root     6224K 3744K sleep   59    0   0:00:00 0.0% sshd/1
   61359 root     2276K 1060K sleep   59    0   0:00:02 0.0% ttymon/1
   61082 root     2488K 1276K sleep   59    0   0:00:02 0.0% pfexecd/3
   59793 root     8824K 6256K sleep   59    0   0:00:36 0.0% svc.configd/13
  Total: 23 processes, 2100 lwps, load averages: 12.02, 7.57, 3.80
  (^c)
  #

So mongod is using ~12% of the cpus on the machine. This machine has 24 cpus, so this is approximately 2.8 cpus at 100% utilization.

  # psrinfo
  0 on-line   since 11/22/2013 03:58:51
  1 on-line   since 11/22/2013 03:58:58
  2 on-line   since 11/22/2013 03:58:58
  3 on-line   since 11/22/2013 03:58:58
  4 on-line   since 11/22/2013 03:58:58
  5 on-line   since 11/22/2013 03:58:58
  6 on-line   since 11/22/2013 03:58:58
  7 on-line   since 11/22/2013 03:58:59
  8 on-line   since 11/22/2013 03:58:59
  9 on-line   since 11/22/2013 03:58:59
  10  on-line   since 11/22/2013 03:58:59
  11  on-line   since 11/22/2013 03:58:59
  12  on-line   since 11/22/2013 03:58:59
  13  on-line   since 11/22/2013 03:58:59
  14  on-line   since 11/22/2013 03:58:59
  15  on-line   since 11/22/2013 03:58:59
  16  on-line   since 11/22/2013 03:58:59
  17  on-line   since 11/22/2013 03:58:59
  18  on-line   since 11/22/2013 03:58:59
  19  on-line   since 11/22/2013 03:58:59
  20  on-line   since 11/22/2013 03:58:59
  21  on-line   since 11/22/2013 03:58:59
  22  on-line   since 11/22/2013 03:58:59
  23  on-line   since 11/22/2013 03:58:59
  #

Let's take a quick look at other statistics (vmstat, vfsstat, mpstat, and iostat) on the system running mongo.

   # vmstat 1
   kthr      memory            page            disk          faults      cpu
   r b w   swap  free  re  mf pi po fr de sr rm s0 s1 zn   in   sy   cs us sy id
   0 0 118 87846368 6706592 4408 50302 5 57 58 0 29 -5789 -1 74 74 18565 68963 14973 10 9 81
   0 0 175 81510632 9860540 8639 97242 0 0 0 0 0 0 0 31 31 28396 129491 48858 13 9 78
   0 0 175 81454632 9853364 8364 94341 0 0 0 0 0 0 0 18 18 31016 125756 47106 14 9 77
   0 0 175 81518144 9862564 3537 23890 0 0 0 0 0 20 0 188 188 28130 139412 50791 17 6 77
   0 0 175 81517996 9862384 840 3519 0 0 0 0 0 0 0 25 25 25988 132854 56871 11 4 85
   0 0 175 81518020 9862768 3545 22590 0 0 0 0 0 0 0 15 15 24562 131160 51082 7 5 88
   0 0 175 81508880 9860908 5840 79660 0 0 0 0 0 0 0 30 30 29503 124580 47687 12 8 80
  (^C)
  #

The system as a whole is ~80% idle. There is no memory pressure on the system, but lots of page faults. We may come back to this later.

Here's vfsstat:

  # vfsstat 1
    r/s   w/s  kr/s  kw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
   76.1  16.5 243.5  55.4   0.0   0.0    2.8   28.6   0   0   0.2   12.4 ded0419b (239)
   43.7   0.0   4.9   0.0   0.0   0.0    2.9    0.0   0   0   0.0    0.0 ded0419b (239)
   43.7   0.0   4.9   0.0   0.0   0.0    3.0    0.0   0   0   0.0    0.0 ded0419b (239)
   43.5   5.8   4.9 417.6   0.0   0.0    3.0  128.1   0   0   0.0    0.0 ded0419b (239)
   43.3   9.6   4.8 639.6   0.0   0.0    2.9  117.1   0   0   0.0    0.0 ded0419b (239)
  (^C)
  #

So, not much file system activity in the zone where mongod is running. The zone is not being I/O throttled (the 'd/s' and 'del_t' columns in the output). This is to be expected. Unlike many (most?) databases, mongoDB uses memory mapped files instead of read/write for database data. It periodically syncs the files to backing store.

Now mpstat:

  # mpstat 1
  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys wt idl
  ...
  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
    0 20718   0    4  2637  117 4506  190  394  245    0  8766   24  42   0  35
    1 3787   0   44   512   54 5167  114  276  144    0 10722   18  14   0  68
    2 13064   0   11   672   39 8919  364  659  206    0 17084   21  24   0  55
    3 1338   0   46   897  186 7578  270  378  206    0 13350   13  13   0  74
    4 5740   0   11   617   49 6637  281  467  155    0 10217   13  15   0  72
    5  134   0   21   427   56 4342   91  238   82    0  5418    6   7   0  87
    6 3418   0   11   780  100 8558  326  553  171    0 13850   15  16   0  69
    7  425   0   23   679   37 8544  248  373  204    0 15762   12  11   0  77
    8 1619   0    1   710   52 8241  341  430  115    0 13678   18  10   0  71
    9   51   0   25   458   41 5290  197  262  176    0  7459    9   9   0  82
   10 2476   0    4   748   66 9833  458  607  165    0 18239   18  13   0  68
   11 1378   0    1   830   75 9630  546  380  147    0 13152   13  12   0  75
   12 4315   0    1   620   65 5900  127  479  133    0 11299   20  15   0  66
   13 1903   0    8   472  105 5247   91  309  191    0 10625   15  13   0  72
   14 2016   0    5   390   30 5982  169  376  139    0 10792   18  16   0  66
   15 6371   0   53   630  271 3790   50  202  190    0  7473   10  13   0  77
   16 4501   0    5 15233 14632 4090  132  365  413    0  8432   32  37   0  31
   17 2962   0    9  7298 6753 3446   83  234  329    0 13258   47  20   0  33
   18 4545   0  494   402   26 4010   94  260  163    0  7613   34  21   0  45
   19    2   0  645   440  114 4668   47  219  113    0  8205   12   6   0  82
   20 1029   0   52   840  159 9325  278  606  117    0 18382   25  13   0  62
   21 1120   0 1431   348   71 6572   94  325  140    0 15726   22  12   0  67
   22 5110   0    4   886  177 6302  162  377  120    0 12207   21  13   0  66
   23 2075   0 1246   304   45 3236   75  193   99    0  7788   15   9   0  76
  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
    0 1702   0   24  2553  140 8653  172  516  152    0 15491   14  13   0  73
    1  520   0  624   475   79 6676   99  317   79    0 12575   14   8   0  78
    2 2752   0   71   517   36 7173  151  465  103    0 14320   17  20   0  63
    3  153   0   67   384   34 5310  181  261  141    0  8308   11   7   0  82
    4  107   0   17   547   42 6262  218  368   97    0  8531   10   8   0  82
    5   11   0   25   410   62 3691   78  210   64    0  4769    4   4   0  92
    6 2216   0   38   774   97 7905  297  453  149    0 13275   12  11   0  77
    7    7   0   17   561   52 7084  216  322   97    0 11671   14   8   0  78
    8   18   0   58   508   31 5952  200  355   76    0  9057    9   7   0  84
    9    6   0   55   396   57 5069  129  219  104    0  6318    5   5   0  90
   10   42   0   79   644   65 8559  259  501  124    0 14886   12   9   0  79
   11  385   0   87   756  102 9742  472  350   85    0 12346   10   9   0  81
   12 1541   0   62   477   74 7014  116  543   92    0 14726   15  10   0  75
   13  944   0   18   356  103 5551   48  329   96    0  9791    8   7   0  85
   14    2   0   14   316   60 5367   74  347   62    0  9730   12  14   0  74
   15  905   0    8   425   41 2503   67  128   43    0  5236   42   4   0  54
   16    8   0   21 14268 13527 5552  135  413  294    0 11511   20  30   0  50
   17    5   0   48  6972 6357 5826   93  304  337    0 11184   23  18   0  59
   18    8   0  509   314   51 5479   81  299   60    0 10575   18  10   0  72
   19    7   0  804   564  202 4123   31  194   65    0 13360    7   6   0  88
   20    2   0   75   758  157 6378  152  356   64    0 11863   26  23   0  52
   21 2287   0  453   339   56 5706   69  256   69    0 11302   21   8   0  70
   22    3   0   64   867  172 5633   76  286   88    0 10716   11  18   0  71
   23   63   0  302   305   74 4025   47  201   46    0  7674    8   4   0  87
  #

The machine is busy, but over 50% idle. Nothing unusual here.

What system calls are being made by mongo?

  # dtrace -n 'syscall:::entry/pid == $target/{@[probefunc]=count();} tick-60s{exit(0);}' -p `pgrep mongod`
  dtrace: description 'syscall:::entry' matched 235 probes
  CPU     ID                    FUNCTION:NAME
   16  76285                        :tick-60s

    getpid                                                            1
    pwrite                                                            1
    ioctl                                                             2
    read                                                              2
    close                                                             3
    open                                                              3
    nanosleep                                                        12
    mmap                                                             30
    memcntl                                                          47
    mincore                                                          78
    fdsync                                                          527
    getcwd                                                          527
    lseek                                                           529
    fstat                                                           531
    lwp_mutex_timedlock                                             998
    lwp_mutex_wakeup                                               1137
    yield                                                          4617
    pollsys                                                        5964
    write                                                          6124
    gtime                                                         28018
    send                                                         477650
    recv                                                         956294
    lwp_park                                                    3121961
    clock_gettime                                               3590578
  #

The (undocumented) lwp_park() system call is used for user level synchronization, and is called when a user level thread needs to block for a user level mutex, reader/writer lock, or condition variable. The clock_gettime system call is made frequently, but should not consume too much time. The send/recv calls are interesting. The recv(2) call is done to retrieve requests from clients, and send(2) calls are made to return responses to clients.

Snooping the network and saving the snoop output to a file, then running wireshark on the captured output shows that the recv/send calls are probably all using the "Mongo Wire Protocol". Searching the internet for "Mongo Wire Protocol" yields http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/. From this document, we'll implement a DTrace script to watch packets being sent and received. The following script snoops packets and displays the message headers being sent/received.

  #!/usr/sbin/dtrace -s

  /* assumes packets to/from mongodb */

  #pragma D option quiet

  /* from http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/
  struct MsgHeader {
      int32   messageLength; /* total message size, including this */
      int32   requestID;     /* identifier for this message */
      int32   responseTo;    /* requestID from the original request */
                             /*   (used in responses from db) */
      int32   opCode;        /* request type - see table below */
  };

  BEGIN
  {
    printf("%10s: %-10s %-10s %-10s %-10s\n",
      "Direction", "Length", "ReqID", "ResTo", "opCode");
  }

  /*
   *  When send/recv(2) are called, assume packet is in "mongo wire protocol"
   *  (see
   *    http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)
   */

  syscall::send:entry
  /pid == $target/
  {
    self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));
    self->mL = self->smh->messageLength;
    self->reqID = self->smh->requestID;
    self->resTo = self->smh->responseTo;
    self->opCode = self->smh->opCode;
  }

  syscall::send:entry
  /pid == $target && self->mL > 0/
  {
    printf("%10s: %-10d %-10d %-10d %-10x\n",
      probefunc, self->mL, self->reqID, self->resTo, self->opCode);
    self->smh = 0;
    self->mL = 0;
    self->reqID = 0;
    self->resTo = 0;
    self->opCode = 0;
  }

  syscall::recv:entry
  /pid == $target/
  {
    self->a1 = arg1;  /* save user address of packet */
  }

  syscall::recv:return  /* now the packet is there. */
  /self->a1/
  {
    self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));
    self->mL = self->rmh->messageLength;
    self->reqID = self->rmh->requestID;
    self->resTo = self->rmh->responseTo;
    self->opCode = self->rmh->opCode;

  }

  syscall::recv:return
  /self->a1 && self->mL > 0/
  {
    printf("%10s: %-10d %-10d %-10d %-10d\n",
      probefunc, self->mL, self->reqID, self->resTo, self->opCode);
    self->a1 = 0;
    self->mh = 0;
    self->mL = 0;
    self->reqID = 0;
    self->resTo = 0;
    self->opCode = 0;
  }

Running the above script while the ycsb benchmark is running shows:

  #./mongo_snoop.d -p `pgrep mongod`
   Direction: Length     ReqID      ResTo      opCode
        recv: 61         962617     0          2004
        recv: 61         961087     0          2004
        recv: 61         961883     0          2004
        send: 110        480867     962617     1
        send: 110        480868     961087     1
        recv: 61         962405     0          2004
        send: 110        480869     961883     1
        recv: 61         961498     0          2004
        send: 110        480870     962405     1
        send: 110        480871     961498     1
        recv: 61         962013     0          2004
        recv: 61         962198     0          2004
        send: 110        480872     962013     1
        send: 110        480873     962198     1
        recv: 61         961709     0          2004
        recv: 1190       962736     0          2002
        recv: 1190       962735     0          2002
        recv: 61         961907     0          2004
        send: 110        480874     961709     1
        send: 110        480875     961907     1
        recv: 1190       962739     0          2002
        recv: 61         961771     0          2004
      ...

In the above output, OpCode 1 is OP_REPLY, a reply sent to a client request. The ResTo field matches the ReqID that was received from the client. opCode 2004 is OP_QUERY, and 2002 is OP_INSERT. The benchmark is currently loading the database.

Here is a script that determines latency between a request being received from a client, and the response for that request.

  #!/usr/sbin/dtrace -s

  /* assumes packets to/from mongodb */

  #pragma D option quiet

  struct MsgHeader {
      int32   messageLength; /* total message size, including this */
      int32   requestID;     /* identifier for this message */
      int32   responseTo;    /* requestID from the original request */
                             /*   (used in reponses from db) */
      int32   opCode;        /* request type - see table below */
  };

  BEGIN
  {
   printf("wait 1 minute...");
  }

  /*
   *  When send/recv(2) are called, assume packet is in "mongo wire protocol"
   *  (see
   *    http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)
   */

  syscall::recv:entry
  /pid == $target/
  {
   self->a1 = arg1;  /* save user address of packet */
  }

  syscall::recv:return  /* now the packet is there. */
  /self->a1 && arg1 > 0/  /* arg1 = number of bytes received */
  {
       self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));
       self->mL = self->rmh->messageLength;
       self->reqID = self->rmh->requestID;
  }

  syscall::recv:return
  /self->a1 && self->mL > 0/
  {
       t[self->reqID] = timestamp;
       self->a1 = 0;
       self->rmh = 0;
       self->mL = 0;
       self->reqID = 0;
  }

  syscall::send:entry
  /pid == $target/
  {
     self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));
     self->mL = self->smh->messageLength;
     self->resTO = self->smh->responseTo;
  }

  syscall::send:entry
  /pid == $target && self->mL > 0 && t[self->resTO] != 0/
  {
          @d["distribution in nsecs"] = quantize(timestamp-t[self->resTO]);
          @a["avg"] = avg(timestamp-t[self->resTO]);
          @t["total requests handled"] = count();
          self->smh = 0;
          self->mL = 0;
          self->resTO = 0;
          t[self->resTO] = 0;
  }

  syscall::send:entry
  /pid == $target && self->mL > 0/
  {
          self->smh = 0;
          self->mL = 0;
          self->resTO = 0;
  }

  tick-60s
  {
        exit(0);
  }

This script runs for one minute and shows the distribution of times taken to handle database requests. Running the script shows:

  # ./mongo_time.d -x dynvarsize=256m -p `pgrep mongod`
  wait 1 minute...
    distribution in nsecs
             value  ------------- Distribution ------------- count
             16384 |                                                  0
             32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              304521
             65536 |@@@@@@@@@@@@                             133288
            131072 |@                                        5643
            262144 |                                         271
            524288 |                                         6
           1048576 |                                         0

    avg                                                           65688
    total requests handled                                       443729
  #

So requests take between 32 and 524 microseconds. Further information can be retrieved from other data in the requests and replies, but is left as an exercise. The setting of dynvarsize is to get rid of "dynamic variable drops" messages.

Here is output of the above script while the test is running (as opposed to loading) the database.

   ./mongo\_time.d -x dynvarsize=256m -p `pgrep mongod`
  wait 1 minute...
    distribution in nsecs
             value  ------------- Distribution ------------- count
             16384 |                                         0
             32768 |@@@@@@@@@@@@@@                           181524
             65536 |@@@@@@@@@@@@                             155129
            131072 |@@@@@                                    64382
            262144 |@@@                                      39143
            524288 |@@@                                      38864
           1048576 |@@                                       25175
           2097152 |                                         5632
           4194304 |                                         1714
           8388608 |                                         451
          16777216 |                                         49
          33554432 |                                         0
          67108864 |                                         0
         134217728 |                                         100
         268435456 |                                         369
         536870912 |                                         0

    avg                                                          548253
    total requests handled                                       512532
  #

Most of the requests take between 32 and 838 microseconds, but some take between 134 and 536 milliseconds. The average latency is much longer when running the database as opposed to loading it. Another thing to note is that when the database is loading, there are many OP_INSERT requests from the YCSB client. When it is running as opposed to loading, the client only sends OP_QUERY requests (seen by running the mongo_snoop.d script from earlier). Most likely, the longer requests are updates as opposed to lookups, but that has not been verified (and as left to the reader as an exercise).

As an experiment, I ran both ycsb and mongo on the same machine, and saw much higher ops/second numbers (over 30000). Running two ycsb clients on different machines did not yield a linear improvement (only around a 20% increase in total ops/second). More experimentation is needed here. It would be interesting to see the effects of multiple clients on the virtual machines actually used for the published results, but I have not tried this.

What happens if we increase the size of the database? Instead of having a 6GB database, what if it were a database larger than the size of the virtual machine. The virtual machine is 8GB. Trying a 10GB database introduced some problems.

Let's go back to prstat and watch the mongod process specifically, but using the larger database.

  # prstat -cmLp `pgrep mongod`
     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
   24426 mongodb  2.0 0.8 0.0 0.0 0.0 2.1  95 0.3  2K 109 11K   0 mongod/22159
   24426 mongodb  1.7 0.7 0.0 0.0 0.0 1.6  96 0.2  2K  32 10K   0 mongod/22115
   24426 mongodb  1.6 0.6 0.0 0.0 0.0 1.7  96 0.2  1K  68  9K   0 mongod/22296
   24426 mongodb  1.4 0.5 0.0 0.0 0.0 1.3  97 0.2  1K  61  7K   0 mongod/21496
   24426 mongodb  1.3 0.5 0.0 0.0 0.0 2.0  96 0.2  1K  55  7K   0 mongod/22240
   24426 mongodb  1.2 0.4 0.0 0.0 0.0 1.6  97 0.2  1K  59  6K   0 mongod/21649
   24426 mongodb  1.2 0.4 0.0 0.0 0.0 1.7  97 0.2  1K  61  6K   0 mongod/22153
   24426 mongodb  1.1 0.4 0.0 0.0 0.0 1.7  97 0.2  1K  52  6K   0 mongod/22293
   24426 mongodb  1.1 0.4 0.0 0.0 0.0 1.0  97 0.1  1K  57  6K   0 mongod/22042
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.6  97 0.2  1K  43  5K   0 mongod/21461
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.5  97 0.1  1K  36  5K   0 mongod/22175
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 0.9  98 0.1  1K  17  6K   0 mongod/22073
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 0.6  98 0.1  1K  25  5K   0 mongod/22105
   24426 mongodb  1.0 0.3 0.0 0.0 0.0 1.3  97 0.1 986  45  5K   0 mongod/22273
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 0.7  98 0.1 964  13  5K   0 mongod/21529
  Total: 1 processes, 982 lwps, load averages: 0.39, 0.86, 3.93
     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
   24426 mongodb  1.2 0.4 0.0 0.0 0.0 0.8  97 0.1  1K  26  6K   0 mongod/22041
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 0.8  98 0.1  1K  18  5K   0 mongod/22040
   24426 mongodb  0.9 0.3 0.0 0.0 0.0 0.7  98 0.1 923  26  4K   0 mongod/21844
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 860  17  4K   0 mongod/22089
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.0  98 0.1 907  32  4K   0 mongod/22175
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.6  98 0.1 869  18  4K   0 mongod/22121
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.5  98 0.1 837  22  4K   0 mongod/22028
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.0  98 0.1 802  38  4K   0 mongod/22151
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 858  18  4K   0 mongod/21407
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 898  11  4K   0 mongod/22013
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 798  45  4K   0 mongod/22230
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.8  98 0.1 785  35  4K   0 mongod/22153
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.2  98 0.1 870  40  4K   0 mongod/22248
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 784  43  4K   0 mongod/22245
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.0  98 0.1 872  28  4K   0 mongod/22195
  Total: 1 processes, 982 lwps, load averages: 0.46, 0.87, 3.91
  (^C)
  #

The mongod process has, at the time this was run, 982 threads. Most of these threads are spending ~98% of the time sleeping. There is a little contention for cpus (LAT column is non-null), and some time is spent waiting on locks (LCK is < 1%). But letting this run for a longer time shows that most threads spend lots of time sleeping, but thread id 3 and 5 spend time locking. This is a periodic activity. Here is output from prstat when thread 3 and 5 are busy.

     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.9  96 0.3  1K  53  5K   0 mongod/22503
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 1.8  97 0.2  1K  94  5K   0 mongod/22852
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.2  98 0.1 936  24  4K   0 mongod/22540
   24426 mongodb  0.7 0.4 0.0 0.0 0.0  99 0.3 0.0 273  11 975   0 mongod/5
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.1  98 0.1 881  56  4K   0 mongod/22700
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 2.0  97 0.2  1K  64  4K   0 mongod/22657
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.7  97 0.1 917  19  4K   0 mongod/22952
   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 793  13  4K   0 mongod/22444
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.2  98 0.1 924  31  3K   0 mongod/23237
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.5  97 0.1 757  39  3K   0 mongod/22353
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.3  98 0.1 838  13  3K   0 mongod/22973
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.5  97 0.2 789  51  3K   0 mongod/22821
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.6  97 0.1 876  18  3K   0 mongod/23127
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 0.9  98 0.1 740  20  3K   0 mongod/22590
   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.3  98 0.1 757  26  3K   0 mongod/22917
  Total: 1 processes, 1013 lwps, load averages: 1.14, 0.39, 0.18
     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
   24426 mongodb  0.0  38 0.0 0.0 0.0  61 0.0 0.0   8  81  15   0 mongod/3
   24426 mongodb  1.6 0.7 0.0 0.0 0.0 5.2  92 0.2  2K  47 11K   0 mongod/23059
   24426 mongodb  1.3 0.5 0.0 0.0 0.0 6.5  91 0.2  1K  84  8K   0 mongod/22978
   24426 mongodb  1.1 0.4 0.0 0.0 0.0 6.9  91 0.2  1K 103  6K   0 mongod/23061
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 6.0  92 0.1  1K  43  6K   0 mongod/23182
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 4.6  94 0.1  1K  27  6K   0 mongod/22815
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 6.4  92 0.2  1K 122  6K   0 mongod/22393
   24426 mongodb  1.0 0.4 0.0 0.0 0.0 5.7  93 0.1  1K  31  6K   0 mongod/22391
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 5.9  93 0.1  1K  61  6K   0 mongod/22802
   24426 mongodb  1.0 0.3 0.0 0.0 0.0 6.4  92 0.1  1K  85  6K   0 mongod/23270
   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.3  92 0.1  1K 109  6K   0 mongod/22871
   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.6  92 0.1  1K  85  5K   0 mongod/22916
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 6.1  93 0.1  1K  26  6K   0 mongod/22517
   24426 mongodb  0.9 0.4 0.0 0.0 0.0 5.9  93 0.1  1K  59  6K   0 mongod/22864
   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.4  92 0.2  1K  97  6K   0 mongod/22459
  Total: 1 processes, 1013 lwps, load averages: 1.20, 0.42, 0.19
     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
   24426 mongodb  0.0  59 0.0 0.0 0.0  40 0.1 0.0  19 230  18   0 mongod/3
   24426 mongodb  0.8 0.3 0.0 0.0 0.0  28  71 0.1 936  39  4K   0 mongod/22526
   24426 mongodb  0.8 0.3 0.0 0.0 0.0  24  75 0.1 889  22  4K   0 mongod/23089
   24426 mongodb  0.7 0.3 0.0 0.0 0.0  25  74 0.1 809  13  4K   0 mongod/22838
   24426 mongodb  0.7 0.3 0.0 0.0 0.0  27  72 0.1 946  23  4K   0 mongod/23077
   24426 mongodb  0.7 0.3 0.0 0.0 0.0  24  75 0.1 859  22  3K   0 mongod/23274
   24426 mongodb  0.6 0.3 0.0 0.0 0.0  29  70 0.1 933  40  4K   0 mongod/22814
   24426 mongodb  0.6 0.4 0.0 0.0 0.0  77  22 0.0 267  15 800   0 mongod/5
   24426 mongodb  0.6 0.2 0.0 0.0 0.0  29  70 0.1 645  32  3K   0 mongod/22790
   24426 mongodb  0.6 0.2 0.0 0.0 0.0  27  72 0.2 701  11  3K   0 mongod/22600
   24426 mongodb  0.6 0.2 0.0 0.0 0.0  28  71 0.1 832  28  3K   0 mongod/22413
   24426 mongodb  0.6 0.2 0.0 0.0 0.0  25  74 0.1 560   5  3K   0 mongod/22415
   24426 mongodb  0.5 0.2 0.0 0.0 0.0  28  71 0.2 651  28  3K   0 mongod/23035
   24426 mongodb  0.5 0.2 0.0 0.0 0.0  27  72 0.1 544  38  2K   0 mongod/22738
   24426 mongodb  0.5 0.2 0.0 0.0 0.0  28  71 0.1 594  22  3K   0 mongod/23273
  Total: 1 processes, 1013 lwps, load averages: 1.23, 0.44, 0.20

Let's see what thread 3 is doing. When it runs, it spends a lot of time in system mode (38 then 59 percent in the above output). We'll use DTrace to profile mongod to see what it is doing while in system mode.

  # dtrace -n 'profile-97/pid == $target && tid == 3 &&
  arg0/{@[stack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`
  -o /var/tmp/mongod.tid3.out
  #

And I'll download the output file and convert to a flamegraph (see https://github.com/davepacheco/node-stackvis).

  # stackvis < mongod.tid3.out > mongod.tid3.htm
  #

Opening the flamegraph shows that thread id 3 is spending almost all of its time in the memcntl(2) system call. (See http://us-east.manta.joyent.com/maxb/public/mongod.tid3.htm). Let's take a look at how memcntl(2) gets called.

  # dtrace -n 'syscall::memcntl:entry/pid == $target && tid == 3/{@[ustack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`
  dtrace: description 'syscall::memcntl:entry' matched 2 probes
  CPU     ID                    FUNCTION:NAME
   13  76285                        :tick-60s

                libc.so.1`memcntl+0xa
                libc.so.1`msync+0xc9
                mongod`_ZN5mongo14PosixFlushable5flushEv+0x3e
                mongod`_ZN5mongo9MongoFile9_flushAllEb+0xfa
                mongod`_ZN5mongo9MongoFile8flushAllEb+0x2b
                mongod`_ZN5mongo12DataFileSync3runEv+0xf3
                mongod`_ZN5mongo13BackgroundJob7jobBodyEv+0xe6
                libboost_thread.so.1.55.0`thread_proxy+0x66
                libc.so.1`_thrp_setup+0x8a
                libc.so.1`_lwp_start
                 19
  #

Taking a look at the log file for mongod, (/var/log/mongodb/mongod.log), we see lots of times when mongod took long periods of time to flush the data files.

  # grep -i flush mongod.log
  2014-06-23T16:12:40.792+0000 [DataFileSync] flushing mmaps took 20970ms  for 33 files
  2014-06-23T18:57:32.343+0000 [DataFileSync] flushing mmaps took 12508ms  for 33 files
  2014-07-19T17:51:37.175+0000 [DataFileSync] flushing mmaps took 11880ms  for 33 files
  ...
  #

And examining the source code for mongodb, the flush uses the msync(3c) library function to do the syncing (see src/mongo/util/mmap_posix.cpp for the code). The msync call in turn calls the memcntl(2) system call. And the larger the database, the longer the flush (i.e., msync(3c)) times.

The msync() function takes three arguments: a starting address, a length, and flags. On SmartOS, the kernel does a page lookup for each page in the range between the starting address and starting address plus length. On the database being loaded and run with 6 million 1K-byte records, (i.e., on a 6GB database), there are 6GB/4KB (1.5 million) page lookups on every flush.

As the database gets larger, this could effect performance. Fortunately in the case of mongodb, there is another, much faster way to sync the database to backing store. This is by using fsync(3C). While msync flushes a range of modified pages to disk, fsync does the same for modified pages of a file. In the case of mongodb, the end result of using msync versus fsync is equivalent. Each msync call is for a 2GB area of memory that is mmap'd to a 2GB file. Instead of calling msync with the range of memory, fsync can be called with the file descriptor for the mmap'd file. In the case of the fsync call, there is no need to do a page lookup. The pages are hanging off of the vnode_t for the file, which can be quickly retrieved via the file descriptor. This gets rid of the long flush times.

Another, more difficult problem, is that by going over the 8GB virtual machine size, the memory capper for the zone kicks in. Performance at this point was 2 orders of magnitude worse than when the machine was below the memory cap. Basically, the benchmark did not scale well with respect to the size of the database. This led my colleague, Jerry Jelinek, to take a look to the memory capper. There were 2 changes made to the memory capper. It no longer stops processes while they are being paged out, and it resumes running where it left off.

The result of these changes has made a vast difference. We no longer see the performance drop precipitously with respect to database size. Much more analysis can be done from this benchmark, but maybe in another blog post (and maybe with a different database).

:

Sign up now for Instant Cloud Access Get Started