Active Benchmarking in the Cloud

Recently, Joyent published a series of benchmarks using YCSB (Yahoo Cloud Service Benchmark)comparing the performance of MongoDB, PostgreSQL, and Elasticsearch between machines runningin the Joyent Public Cloud and similarly configured machines running on Amazon Web Services.

When I looked at the results, my first impression was that the numberslook good, but how can I tell what is really being benchmarked? To dothis, I needed to do "active benchmarking". This basically means thatI wanted to analyze the performance of the system while it was beingtested. I decided to look specifically at the YCSB benchmark that used MongoDB.

Rather than use shards, a router, and the entire setup that is used in the published results, I used a single machine in the Joyent cloud that ran the YCSB benchmark with another machine in the Joyent cloud that was running MongoDB. The machine running YCSB is a standard 4GB instance, and the machine running MongoDB is a standard 7.5GB system.

In this blog, we'll first simply run the benchmark and watch results.Then we'll do it again, but do active benchmarking while the benchmarkis running. We'll do this to try to determine what are the limitingfactors (i.e., what is the bottleneck for the test results). We'llalso take a look at scaling issues.

First, we'll run the benchmark to load 6 million records, each record consisting of 10 fields of 1 hundred bytes each, creating 1k of data per record.

  # time ./bin/ycsb load mongodb -s -P workloads/workloada -p recordcount=6000000 -p  mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p   operationcount=6000000  YCSB Client 0.1  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -load  Loading workload...  Starting test.   0 sec: 0 operations;  mongo connection created with 199.192.241.168:27017/ycsb   10 sec: 17563 operations; 1753.49 current ops/sec; [INSERT AverageLatency(us)=302209.69]   20 sec: 73255 operations; 5525.55 current ops/sec; [INSERT AverageLatency(us)=48674.83]   30 sec: 147128 operations; 7385.82 current ops/sec; [INSERT AverageLatency(us)=106637.21]   40 sec: 228286 operations; 8114.18 current ops/sec; [INSERT AverageLatency(us)=117745.16]  ...   640 sec: 4993046 operations; 8210.88 current ops/sec; [INSERT AverageLatency(us)=123054.32]   650 sec: 5074073 operations; 8102.7 current ops/sec; [INSERT AverageLatency(us)=120990.57]   660 sec: 5150469 operations; 7638.84 current ops/sec; [INSERT AverageLatency(us)=127621.01] [CLEANUP AverageLatency(us)=26]   670 sec: 5223843 operations; 7336.67 current ops/sec; [INSERT AverageLatency(us)=144175.32] [CLEANUP AverageLatency(us)=6.33]   680 sec: 5297258 operations; 7340.77 current ops/sec; [INSERT AverageLatency(us)=131909.57] [CLEANUP AverageLatency(us)=4.67]   690 sec: 5376306 operations; 7904.01 current ops/sec; [INSERT AverageLatency(us)=124980.39] [CLEANUP AverageLatency(us)=5.62]  ...  [OVERALL], RunTime(ms), 771609.0  [OVERALL], Throughput(ops/sec), 7775.9590673514695  [INSERT], Operations, 6000000  [INSERT], AverageLatency(us), 122792.53285266667  [INSERT], MinLatency(us), 401  [INSERT], MaxLatency(us), 7160628  [INSERT], Return=0, 6000000  [INSERT], 0, 4387109  [INSERT], 1, 435791  ...  [CLEANUP], Operations, 1000  [CLEANUP], AverageLatency(us), 70.878  [CLEANUP], MinLatency(us), 1  [CLEANUP], MaxLatency(us), 66032  [CLEANUP], 95thPercentileLatency(ms), 0  [CLEANUP], 99thPercentileLatency(ms), 0  [CLEANUP], 0, 999  [CLEANUP], 1, 0  [CLEANUP], 2, 0  ...  real      12m52.267s  user  13m33.122s  sys   3m24.784s  #

So loading 6,000,000 records took 12 minutes, 52 seconds.

Now we'll run workloada (a mixture of 50/50 reads/writes).

  # time ./bin/ycsb run mongodb -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000  java -cp /root/YCSB-master/infinispan/src/main/conf:/root/YCSB-master/mongodb/target/mongodb-binding-0.1.4.jar:/root/YCSB-master/mongodb/target/archive-tmp/mongodb-binding-0.1.4.jar:/root/YCSB-master/jdbc/src/main/conf:/root/YCSB-master/gemfire/src/main/conf:/root/YCSB-master/core/target/core-0.1.4.jar:/root/YCSB-master/dynamodb/conf:/root/YCSB-master/nosqldb/src/main/conf:/root/YCSB-master/voldemort/src/main/conf:/root/YCSB-master/hbase/src/main/conf com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t  YCSB Client 0.1  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t  Loading workload...  Starting test.   0 sec: 0 operations;  mongo connection created with 199.192.241.168:27017/ycsb   10 sec: 11377 operations; 1130.24 current ops/sec; [UPDATE AverageLatency(us)=747864.1] [READ AverageLatency(us)=379445.04]   20 sec: 60430 operations; 4877.98 current ops/sec; [UPDATE AverageLatency(us)=86604.21] [READ AverageLatency(us)=69050.56]   30 sec: 135550 operations; 7480.58 current ops/sec; [UPDATE AverageLatency(us)=96481.73] [READ AverageLatency(us)=94490]   40 sec: 229001 operations; 9344.17 current ops/sec; [UPDATE AverageLatency(us)=68793.49] [READ AverageLatency(us)=64952.19]   50 sec: 328821 operations; 9920.49 current ops/sec; [UPDATE AverageLatency(us)=57943.78] [READ AverageLatency(us)=61490.36]   60 sec: 442787 operations; 11335.39 current ops/sec; [UPDATE AverageLatency(us)=68971.15] [READ AverageLatency(us)=60517.94]   70 sec: 555053 operations; 11201.96 current ops/sec; [UPDATE AverageLatency(us)=60592.3] [READ AverageLatency(us)=33707.96]   80 sec: 671583 operations; 11603.11 current ops/sec; [UPDATE AverageLatency(us)=74702.3] [READ AverageLatency(us)=46173.29]   ...   [OVERALL], RunTime(ms), 522700.0  [OVERALL], Throughput(ops/sec), 11478.859766596519  [UPDATE], Operations, 2999228  [UPDATE], AverageLatency(us), 85524.36139733292  [UPDATE], MinLatency(us), 394  [UPDATE], MaxLatency(us), 14435871  [UPDATE], 95thPercentileLatency(ms), 365  [UPDATE], Return=0, 2999228  [UPDATE], 0, 869530  ...  [READ], Operations, 3000772  [READ], AverageLatency(us), 29704.56337702431  [READ], MinLatency(us), 353  [READ], MaxLatency(us), 14292934  [READ], 95thPercentileLatency(ms), 115  [READ], 99thPercentileLatency(ms), 345  [READ], Return=0, 3000772  [READ], 0, 706416  ...  [CLEANUP], Operations, 1000  [CLEANUP], AverageLatency(us), 102.235  [CLEANUP], MinLatency(us), 1  [CLEANUP], MaxLatency(us), 81274  [CLEANUP], 95thPercentileLatency(ms), 0  [CLEANUP], 99thPercentileLatency(ms), 0  [CLEANUP], 0, 998  [CLEANUP], 1, 0  ...  real     8m43.573s  user   13m53.838s  sys   3m40.015s  #

So in 6 million read/write operations, there was 3,000,772 reads and 2,999,228 writes (close enough to a 50/50 split). Reads took on average about 29.7 msecs, with a minimum latency of 394 usecs and a maximum latency of 14 seconds. Updates (writes) took about the same amount of time.

Let's try workloadb, a 95/5 split between reads and writes (95% reads).

  # time ./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000  ...  YCSB Client 0.1  Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t  Loading workload...  Starting test.   0 sec: 0 operations;  mongo connection created with 199.192.241.168:27017/ycsb   10 sec: 39561 operations; 3937.99 current ops/sec; [UPDATE AverageLatency(us)=64134.53] [READ AverageLatency(us)=77735.02]   20 sec: 128222 operations; 8806.22 current ops/sec; [UPDATE AverageLatency(us)=33152.18] [READ AverageLatency(us)=37736.67]   30 sec: 202620 operations; 7342.87 current ops/sec; [UPDATE AverageLatency(us)=75466.98] [READ AverageLatency(us)=58739.72]   40 sec: 286881 operations; 8426.1 current ops/sec; [UPDATE AverageLatency(us)=52512.36] [READ AverageLatency(us)=43191.99]   50 sec: 381094 operations; 9420.36 current ops/sec; [UPDATE AverageLatency(us)=31639.92] [READ AverageLatency(us)=26998.68]   60 sec: 483100 operations; 10199.58 current ops/sec; [UPDATE AverageLatency(us)=33720.43] [READ AverageLatency(us)=52603.76]   70 sec: 587825 operations; 10418.32 current ops/sec; [UPDATE AverageLatency(us)=42287.56] [READ AverageLatency(us)=37719.84]  ...  [OVERALL], RunTime(ms), 559369.0  [OVERALL], Throughput(ops/sec), 10726.37203706319  [UPDATE], Operations, 299901  [UPDATE], AverageLatency(us), 44403.52669714339  [UPDATE], MinLatency(us), 379  [UPDATE], MaxLatency(us), 3324334  [UPDATE], 95thPercentileLatency(ms), 138  [UPDATE], 99thPercentileLatency(ms), 740  [UPDATE], Return=0, 299901  [UPDATE], 0, 145122  [UPDATE], 1, 14669  ...  [READ], Operations, 5700099  [READ], AverageLatency(us), 37128.780924682185  [READ], MinLatency(us), 329  [READ], MaxLatency(us), 3876008  [READ], 95thPercentileLatency(ms), 137  [READ], 99thPercentileLatency(ms), 381  [READ], Return=0, 5700099  [READ], 0, 2938426  [READ], 1, 193083  ...  [CLEANUP], Operations, 1000  [CLEANUP], AverageLatency(us), 95.518  [CLEANUP], MinLatency(us), 1  [CLEANUP], MaxLatency(us), 91746  [CLEANUP], 95thPercentileLatency(ms), 0  [CLEANUP], 99thPercentileLatency(ms), 0  [CLEANUP], 0, 999  [CLEANUP], 1, 0  [CLEANUP], 2, 0  ...  real        9m19.973s  user   15m37.622s  sys   3m28.291s  #

Here, there are 5,700,099 read and 299,901 updates. Roughly a 95/5split as workloadb specifies. The remaining workloads yield similarresults.

Now we'll run the benchmark and begin to do some active benchmarking.For starters, let's use prstat(1M) on the system where mongo is running.

  # prstat -c  Please wait...     PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP   24426 mongodb    70G  244M cpu21   60    0   3:06:37  12% mongod/2014   48484 root     3256K 2036K sleep   59    0   0:00:04 0.0% bash/1   42051 root     3272K 2248K sleep   59    0   0:00:00 0.0% bash/1   48278 root     5560K 1388K sleep   59    0   0:00:11 0.0% sshd/1   61323 root     7360K 2376K sleep   59    0   0:00:49 0.0% rsyslogd/6   41639 root     5560K 2904K sleep   59    0   0:00:00 0.0% sshd/1   59741 root     2676K 1388K sleep   59    0   0:00:06 0.0% init/1   61348 root     1652K 1008K sleep   59    0   0:00:04 0.0% utmpd/1   61357 root     1940K 1028K sleep   59    0   0:00:03 0.0% ttymon/1   59626 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1   41641 root     6224K 3744K sleep   59    0   0:00:00 0.0% sshd/1   61359 root     2276K 1060K sleep   59    0   0:00:02 0.0% ttymon/1   61082 root     2488K 1276K sleep   59    0   0:00:02 0.0% pfexecd/3   59793 root     8824K 6256K sleep   59    0   0:00:36 0.0% svc.configd/13   61378 root     4340K 1496K sleep   59    0   0:00:10 0.0% sshd/1  Total: 23 processes, 2100 lwps, load averages: 12.39, 7.57, 3.78     PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP   24426 mongodb    70G  242M cpu22   52    0   3:06:49  12% mongod/2014   76885 root     4064K 3276K cpu6    59    0   0:00:00 0.0% prstat/1   48484 root     3256K 2036K sleep   59    0   0:00:04 0.0% bash/1   42051 root     3272K 2248K sleep   59    0   0:00:00 0.0% bash/1   48278 root     5560K 1388K sleep   59    0   0:00:11 0.0% sshd/1   61323 root     7360K 2376K sleep   59    0   0:00:49 0.0% rsyslogd/6   41639 root     5560K 2904K sleep   59    0   0:00:00 0.0% sshd/1   59741 root     2676K 1388K sleep   59    0   0:00:06 0.0% init/1   61348 root     1652K 1008K sleep   59    0   0:00:04 0.0% utmpd/1   61357 root     1940K 1028K sleep   59    0   0:00:03 0.0% ttymon/1   59626 root        0K    0K sleep   60    -   0:00:00 0.0% zsched/1   41641 root     6224K 3744K sleep   59    0   0:00:00 0.0% sshd/1   61359 root     2276K 1060K sleep   59    0   0:00:02 0.0% ttymon/1   61082 root     2488K 1276K sleep   59    0   0:00:02 0.0% pfexecd/3   59793 root     8824K 6256K sleep   59    0   0:00:36 0.0% svc.configd/13  Total: 23 processes, 2100 lwps, load averages: 12.02, 7.57, 3.80  (^c)  #

So mongod is using ~12% of the cpus on the machine. This machine has24 cpus, so this is approximately 2.8 cpus at 100% utilization.

  # psrinfo  0 on-line   since 11/22/2013 03:58:51  1 on-line   since 11/22/2013 03:58:58  2 on-line   since 11/22/2013 03:58:58  3 on-line   since 11/22/2013 03:58:58  4 on-line   since 11/22/2013 03:58:58  5 on-line   since 11/22/2013 03:58:58  6 on-line   since 11/22/2013 03:58:58  7 on-line   since 11/22/2013 03:58:59  8 on-line   since 11/22/2013 03:58:59  9 on-line   since 11/22/2013 03:58:59  10  on-line   since 11/22/2013 03:58:59  11  on-line   since 11/22/2013 03:58:59  12  on-line   since 11/22/2013 03:58:59  13  on-line   since 11/22/2013 03:58:59  14  on-line   since 11/22/2013 03:58:59  15  on-line   since 11/22/2013 03:58:59  16  on-line   since 11/22/2013 03:58:59  17  on-line   since 11/22/2013 03:58:59  18  on-line   since 11/22/2013 03:58:59  19  on-line   since 11/22/2013 03:58:59  20  on-line   since 11/22/2013 03:58:59  21  on-line   since 11/22/2013 03:58:59  22  on-line   since 11/22/2013 03:58:59  23  on-line   since 11/22/2013 03:58:59  #

Let's take a quick look at other statistics (vmstat, vfsstat, mpstat,and iostat) on the system running mongo.

   # vmstat 1   kthr      memory            page            disk          faults      cpu   r b w   swap  free  re  mf pi po fr de sr rm s0 s1 zn   in   sy   cs us sy id   0 0 118 87846368 6706592 4408 50302 5 57 58 0 29 -5789 -1 74 74 18565 68963 14973 10 9 81   0 0 175 81510632 9860540 8639 97242 0 0 0 0 0 0 0 31 31 28396 129491 48858 13 9 78   0 0 175 81454632 9853364 8364 94341 0 0 0 0 0 0 0 18 18 31016 125756 47106 14 9 77   0 0 175 81518144 9862564 3537 23890 0 0 0 0 0 20 0 188 188 28130 139412 50791 17 6 77   0 0 175 81517996 9862384 840 3519 0 0 0 0 0 0 0 25 25 25988 132854 56871 11 4 85   0 0 175 81518020 9862768 3545 22590 0 0 0 0 0 0 0 15 15 24562 131160 51082 7 5 88   0 0 175 81508880 9860908 5840 79660 0 0 0 0 0 0 0 30 30 29503 124580 47687 12 8 80  (^C)  #

The system as a whole is ~80% idle. There is no memory pressure onthe system, but lots of page faults. We may come back to this later.

Here's vfsstat:

  # vfsstat 1    r/s   w/s  kr/s  kw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone   76.1  16.5 243.5  55.4   0.0   0.0    2.8   28.6   0   0   0.2   12.4 ded0419b (239)   43.7   0.0   4.9   0.0   0.0   0.0    2.9    0.0   0   0   0.0    0.0 ded0419b (239)   43.7   0.0   4.9   0.0   0.0   0.0    3.0    0.0   0   0   0.0    0.0 ded0419b (239)   43.5   5.8   4.9 417.6   0.0   0.0    3.0  128.1   0   0   0.0    0.0 ded0419b (239)   43.3   9.6   4.8 639.6   0.0   0.0    2.9  117.1   0   0   0.0    0.0 ded0419b (239)  (^C)  #

So, not much file system activity in the zone where mongod isrunning. The zone is not being I/O throttled (the 'd/s' and 'del_t'columns in the output). This is to be expected. Unlike many (most?) databases,mongoDB uses memory mapped files instead of read/write for databasedata. It periodically syncs the files to backing store.

Now mpstat:

  # mpstat 1  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys wt idl  ...  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0 20718   0    4  2637  117 4506  190  394  245    0  8766   24  42   0  35    1 3787   0   44   512   54 5167  114  276  144    0 10722   18  14   0  68    2 13064   0   11   672   39 8919  364  659  206    0 17084   21  24   0  55    3 1338   0   46   897  186 7578  270  378  206    0 13350   13  13   0  74    4 5740   0   11   617   49 6637  281  467  155    0 10217   13  15   0  72    5  134   0   21   427   56 4342   91  238   82    0  5418    6   7   0  87    6 3418   0   11   780  100 8558  326  553  171    0 13850   15  16   0  69    7  425   0   23   679   37 8544  248  373  204    0 15762   12  11   0  77    8 1619   0    1   710   52 8241  341  430  115    0 13678   18  10   0  71    9   51   0   25   458   41 5290  197  262  176    0  7459    9   9   0  82   10 2476   0    4   748   66 9833  458  607  165    0 18239   18  13   0  68   11 1378   0    1   830   75 9630  546  380  147    0 13152   13  12   0  75   12 4315   0    1   620   65 5900  127  479  133    0 11299   20  15   0  66   13 1903   0    8   472  105 5247   91  309  191    0 10625   15  13   0  72   14 2016   0    5   390   30 5982  169  376  139    0 10792   18  16   0  66   15 6371   0   53   630  271 3790   50  202  190    0  7473   10  13   0  77   16 4501   0    5 15233 14632 4090  132  365  413    0  8432   32  37   0  31   17 2962   0    9  7298 6753 3446   83  234  329    0 13258   47  20   0  33   18 4545   0  494   402   26 4010   94  260  163    0  7613   34  21   0  45   19    2   0  645   440  114 4668   47  219  113    0  8205   12   6   0  82   20 1029   0   52   840  159 9325  278  606  117    0 18382   25  13   0  62   21 1120   0 1431   348   71 6572   94  325  140    0 15726   22  12   0  67   22 5110   0    4   886  177 6302  162  377  120    0 12207   21  13   0  66   23 2075   0 1246   304   45 3236   75  193   99    0  7788   15   9   0  76  CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl    0 1702   0   24  2553  140 8653  172  516  152    0 15491   14  13   0  73    1  520   0  624   475   79 6676   99  317   79    0 12575   14   8   0  78    2 2752   0   71   517   36 7173  151  465  103    0 14320   17  20   0  63    3  153   0   67   384   34 5310  181  261  141    0  8308   11   7   0  82    4  107   0   17   547   42 6262  218  368   97    0  8531   10   8   0  82    5   11   0   25   410   62 3691   78  210   64    0  4769    4   4   0  92    6 2216   0   38   774   97 7905  297  453  149    0 13275   12  11   0  77    7    7   0   17   561   52 7084  216  322   97    0 11671   14   8   0  78    8   18   0   58   508   31 5952  200  355   76    0  9057    9   7   0  84    9    6   0   55   396   57 5069  129  219  104    0  6318    5   5   0  90   10   42   0   79   644   65 8559  259  501  124    0 14886   12   9   0  79   11  385   0   87   756  102 9742  472  350   85    0 12346   10   9   0  81   12 1541   0   62   477   74 7014  116  543   92    0 14726   15  10   0  75   13  944   0   18   356  103 5551   48  329   96    0  9791    8   7   0  85   14    2   0   14   316   60 5367   74  347   62    0  9730   12  14   0  74   15  905   0    8   425   41 2503   67  128   43    0  5236   42   4   0  54   16    8   0   21 14268 13527 5552  135  413  294    0 11511   20  30   0  50   17    5   0   48  6972 6357 5826   93  304  337    0 11184   23  18   0  59   18    8   0  509   314   51 5479   81  299   60    0 10575   18  10   0  72   19    7   0  804   564  202 4123   31  194   65    0 13360    7   6   0  88   20    2   0   75   758  157 6378  152  356   64    0 11863   26  23   0  52   21 2287   0  453   339   56 5706   69  256   69    0 11302   21   8   0  70   22    3   0   64   867  172 5633   76  286   88    0 10716   11  18   0  71   23   63   0  302   305   74 4025   47  201   46    0  7674    8   4   0  87  #

The machine is busy, but over 50% idle. Nothing unusual here.

What system calls are being made by mongo?

  # dtrace -n 'syscall:::entry/pid == $target/{@[probefunc]=count();} tick-60s{exit(0);}' -p `pgrep mongod`  dtrace: description 'syscall:::entry' matched 235 probes  CPU     ID                    FUNCTION:NAME   16  76285                        :tick-60s    getpid                                                            1    pwrite                                                            1    ioctl                                                             2    read                                                              2    close                                                             3    open                                                              3    nanosleep                                                        12    mmap                                                             30    memcntl                                                          47    mincore                                                          78    fdsync                                                          527    getcwd                                                          527    lseek                                                           529    fstat                                                           531    lwp_mutex_timedlock                                             998    lwp_mutex_wakeup                                               1137    yield                                                          4617    pollsys                                                        5964    write                                                          6124    gtime                                                         28018    send                                                         477650    recv                                                         956294    lwp_park                                                    3121961    clock_gettime                                               3590578  #

The (undocumented) lwp_park() system call is used for user levelsynchronization, and is called when a user level thread needs to block for auser level mutex, reader/writer lock, or condition variable. Theclock_gettime system call is made frequently, but should not consumetoo much time. The send/recv calls are interesting. The recv(2) callis done to retrieve requests from clients, and send(2) calls are madeto return responses to clients.

Snooping the network and saving the snoop output to a file, thenrunning wireshark on the captured output shows that the recv/sendcalls are probably all using the "Mongo Wire Protocol". Searching theinternet for "Mongo Wire Protocol" yieldshttp://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/.From this document, we'll implement a DTrace script to watch packetsbeing sent and received. The following script snoops packets anddisplays the message headers being sent/received.

  #!/usr/sbin/dtrace -s  /* assumes packets to/from mongodb */  #pragma D option quiet  /* from http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/  struct MsgHeader {      int32   messageLength; /* total message size, including this */      int32   requestID;     /* identifier for this message */      int32   responseTo;    /* requestID from the original request */                             /*   (used in responses from db) */      int32   opCode;        /* request type - see table below */  };  BEGIN  {    printf("%10s: %-10s %-10s %-10s %-10s\n",      "Direction", "Length", "ReqID", "ResTo", "opCode");  }  /*   *  When send/recv(2) are called, assume packet is in "mongo wire protocol"   *  (see   *    http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)   */  syscall::send:entry  /pid == $target/  {    self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));    self->mL = self->smh->messageLength;    self->reqID = self->smh->requestID;    self->resTo = self->smh->responseTo;    self->opCode = self->smh->opCode;  }  syscall::send:entry  /pid == $target && self->mL > 0/  {    printf("%10s: %-10d %-10d %-10d %-10x\n",      probefunc, self->mL, self->reqID, self->resTo, self->opCode);    self->smh = 0;    self->mL = 0;    self->reqID = 0;    self->resTo = 0;    self->opCode = 0;  }  syscall::recv:entry  /pid == $target/  {    self->a1 = arg1;  /* save user address of packet */  }  syscall::recv:return  /* now the packet is there. */  /self->a1/  {    self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));    self->mL = self->rmh->messageLength;    self->reqID = self->rmh->requestID;    self->resTo = self->rmh->responseTo;    self->opCode = self->rmh->opCode;  }  syscall::recv:return  /self->a1 && self->mL > 0/  {    printf("%10s: %-10d %-10d %-10d %-10d\n",      probefunc, self->mL, self->reqID, self->resTo, self->opCode);    self->a1 = 0;    self->mh = 0;    self->mL = 0;    self->reqID = 0;    self->resTo = 0;    self->opCode = 0;  }

Running the above script while the ycsb benchmark is running shows:

  #./mongo_snoop.d -p `pgrep mongod`   Direction: Length     ReqID      ResTo      opCode        recv: 61         962617     0          2004        recv: 61         961087     0          2004        recv: 61         961883     0          2004        send: 110        480867     962617     1        send: 110        480868     961087     1        recv: 61         962405     0          2004        send: 110        480869     961883     1        recv: 61         961498     0          2004        send: 110        480870     962405     1        send: 110        480871     961498     1        recv: 61         962013     0          2004        recv: 61         962198     0          2004        send: 110        480872     962013     1        send: 110        480873     962198     1        recv: 61         961709     0          2004        recv: 1190       962736     0          2002        recv: 1190       962735     0          2002        recv: 61         961907     0          2004        send: 110        480874     961709     1        send: 110        480875     961907     1        recv: 1190       962739     0          2002        recv: 61         961771     0          2004      ...

In the above output, OpCode 1 is OP_REPLY, a reply sent to a clientrequest. The ResTo field matches the ReqID that was received from theclient. opCode 2004 is OP_QUERY, and 2002 is OP_INSERT. Thebenchmark is currently loading the database.

Here is a script that determines latency between a request beingreceived from a client, and the response for that request.

  #!/usr/sbin/dtrace -s  /* assumes packets to/from mongodb */  #pragma D option quiet  struct MsgHeader {      int32   messageLength; /* total message size, including this */      int32   requestID;     /* identifier for this message */      int32   responseTo;    /* requestID from the original request */                             /*   (used in reponses from db) */      int32   opCode;        /* request type - see table below */  };  BEGIN  {   printf("wait 1 minute...");  }  /*   *  When send/recv(2) are called, assume packet is in "mongo wire protocol"   *  (see   *    http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/)   */  syscall::recv:entry  /pid == $target/  {   self->a1 = arg1;  /* save user address of packet */  }  syscall::recv:return  /* now the packet is there. */  /self->a1 && arg1 > 0/  /* arg1 = number of bytes received */  {       self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader));       self->mL = self->rmh->messageLength;       self->reqID = self->rmh->requestID;  }  syscall::recv:return  /self->a1 && self->mL > 0/  {       t[self->reqID] = timestamp;       self->a1 = 0;       self->rmh = 0;       self->mL = 0;       self->reqID = 0;  }  syscall::send:entry  /pid == $target/  {     self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader));     self->mL = self->smh->messageLength;     self->resTO = self->smh->responseTo;  }  syscall::send:entry  /pid == $target && self->mL > 0 && t[self->resTO] != 0/  {          @d["distribution in nsecs"] = quantize(timestamp-t[self->resTO]);          @a["avg"] = avg(timestamp-t[self->resTO]);          @t["total requests handled"] = count();          self->smh = 0;          self->mL = 0;          self->resTO = 0;          t[self->resTO] = 0;  }  syscall::send:entry  /pid == $target && self->mL > 0/  {          self->smh = 0;          self->mL = 0;          self->resTO = 0;  }  tick-60s  {        exit(0);  }

This script runs for one minute and shows the distribution of timestaken to handle database requests. Running the script shows:

  # ./mongo_time.d -x dynvarsize=256m -p `pgrep mongod`  wait 1 minute...    distribution in nsecs             value  ------------- Distribution ------------- count             16384 |                                                  0             32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              304521             65536 |@@@@@@@@@@@@                             133288            131072 |@                                        5643            262144 |                                         271            524288 |                                         6           1048576 |                                         0    avg                                                           65688    total requests handled                                       443729  #

So requests take between 32 and 524 microseconds. Further informationcan be retrieved from other data in the requests and replies, but isleft as an exercise. The setting of dynvarsize is to get rid of"dynamic variable drops" messages.

Here is output of the above script while the test is running (asopposed to loading) the database.

   ./mongo\_time.d -x dynvarsize=256m -p `pgrep mongod`  wait 1 minute...    distribution in nsecs             value  ------------- Distribution ------------- count             16384 |                                         0             32768 |@@@@@@@@@@@@@@                           181524             65536 |@@@@@@@@@@@@                             155129            131072 |@@@@@                                    64382            262144 |@@@                                      39143            524288 |@@@                                      38864           1048576 |@@                                       25175           2097152 |                                         5632           4194304 |                                         1714           8388608 |                                         451          16777216 |                                         49          33554432 |                                         0          67108864 |                                         0         134217728 |                                         100         268435456 |                                         369         536870912 |                                         0    avg                                                          548253    total requests handled                                       512532  #

Most of the requests take between 32 and 838 microseconds, but sometake between 134 and 536 milliseconds. The average latency is muchlonger when running the database as opposed to loading it. Another thing to note is thatwhen the database is loading, there are many OP_INSERT requests fromthe YCSB client. When it is running as opposed to loading, the client only sends OP_QUERYrequests (seen by running the mongo_snoop.d script from earlier).Most likely, the longer requests are updates as opposed to lookups,but that has not been verified (and as left to the reader as anexercise).

As an experiment, I ran both ycsb and mongo on the same machine, andsaw much higher ops/second numbers (over 30000). Running two ycsbclients on different machines did not yield a linear improvement(only around a 20% increase in total ops/second). Moreexperimentation is needed here. It would be interesting to see theeffects of multiple clients on the virtual machines actually used for thepublished results, but I have not tried this.

What happens if we increase the size of the database? Instead ofhaving a 6GB database, what if it were a database larger than the sizeof the virtual machine. The virtual machine is 8GB. Trying a 10GBdatabase introduced some problems.

Let's go back to prstat and watch the mongod process specifically, butusing the larger database.

  # prstat -cmLp `pgrep mongod`     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID   24426 mongodb  2.0 0.8 0.0 0.0 0.0 2.1  95 0.3  2K 109 11K   0 mongod/22159   24426 mongodb  1.7 0.7 0.0 0.0 0.0 1.6  96 0.2  2K  32 10K   0 mongod/22115   24426 mongodb  1.6 0.6 0.0 0.0 0.0 1.7  96 0.2  1K  68  9K   0 mongod/22296   24426 mongodb  1.4 0.5 0.0 0.0 0.0 1.3  97 0.2  1K  61  7K   0 mongod/21496   24426 mongodb  1.3 0.5 0.0 0.0 0.0 2.0  96 0.2  1K  55  7K   0 mongod/22240   24426 mongodb  1.2 0.4 0.0 0.0 0.0 1.6  97 0.2  1K  59  6K   0 mongod/21649   24426 mongodb  1.2 0.4 0.0 0.0 0.0 1.7  97 0.2  1K  61  6K   0 mongod/22153   24426 mongodb  1.1 0.4 0.0 0.0 0.0 1.7  97 0.2  1K  52  6K   0 mongod/22293   24426 mongodb  1.1 0.4 0.0 0.0 0.0 1.0  97 0.1  1K  57  6K   0 mongod/22042   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.6  97 0.2  1K  43  5K   0 mongod/21461   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.5  97 0.1  1K  36  5K   0 mongod/22175   24426 mongodb  1.0 0.4 0.0 0.0 0.0 0.9  98 0.1  1K  17  6K   0 mongod/22073   24426 mongodb  1.0 0.4 0.0 0.0 0.0 0.6  98 0.1  1K  25  5K   0 mongod/22105   24426 mongodb  1.0 0.3 0.0 0.0 0.0 1.3  97 0.1 986  45  5K   0 mongod/22273   24426 mongodb  0.9 0.4 0.0 0.0 0.0 0.7  98 0.1 964  13  5K   0 mongod/21529  Total: 1 processes, 982 lwps, load averages: 0.39, 0.86, 3.93     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID   24426 mongodb  1.2 0.4 0.0 0.0 0.0 0.8  97 0.1  1K  26  6K   0 mongod/22041   24426 mongodb  0.9 0.4 0.0 0.0 0.0 0.8  98 0.1  1K  18  5K   0 mongod/22040   24426 mongodb  0.9 0.3 0.0 0.0 0.0 0.7  98 0.1 923  26  4K   0 mongod/21844   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 860  17  4K   0 mongod/22089   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.0  98 0.1 907  32  4K   0 mongod/22175   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.6  98 0.1 869  18  4K   0 mongod/22121   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.5  98 0.1 837  22  4K   0 mongod/22028   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.0  98 0.1 802  38  4K   0 mongod/22151   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 858  18  4K   0 mongod/21407   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.7  98 0.1 898  11  4K   0 mongod/22013   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 798  45  4K   0 mongod/22230   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.8  98 0.1 785  35  4K   0 mongod/22153   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.2  98 0.1 870  40  4K   0 mongod/22248   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 784  43  4K   0 mongod/22245   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.0  98 0.1 872  28  4K   0 mongod/22195  Total: 1 processes, 982 lwps, load averages: 0.46, 0.87, 3.91  (^C)  #

The mongod process has, at the time this was run, 982 threads. Mostof these threads are spending ~98% of the time sleeping. There is alittle contention for cpus (LAT column is non-null), and some time isspent waiting on locks (LCK is < 1%). But letting this run for alonger time shows that most threads spend lots of time sleeping, butthread id 3 and 5 spend time locking. This is a periodic activity.Here is output from prstat when thread 3 and 5 are busy.

     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID   24426 mongodb  1.0 0.4 0.0 0.0 0.0 1.9  96 0.3  1K  53  5K   0 mongod/22503   24426 mongodb  0.9 0.4 0.0 0.0 0.0 1.8  97 0.2  1K  94  5K   0 mongod/22852   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.2  98 0.1 936  24  4K   0 mongod/22540   24426 mongodb  0.7 0.4 0.0 0.0 0.0  99 0.3 0.0 273  11 975   0 mongod/5   24426 mongodb  0.8 0.3 0.0 0.0 0.0 1.1  98 0.1 881  56  4K   0 mongod/22700   24426 mongodb  0.8 0.3 0.0 0.0 0.0 2.0  97 0.2  1K  64  4K   0 mongod/22657   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.7  97 0.1 917  19  4K   0 mongod/22952   24426 mongodb  0.8 0.3 0.0 0.0 0.0 0.9  98 0.1 793  13  4K   0 mongod/22444   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.2  98 0.1 924  31  3K   0 mongod/23237   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.5  97 0.1 757  39  3K   0 mongod/22353   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.3  98 0.1 838  13  3K   0 mongod/22973   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.5  97 0.2 789  51  3K   0 mongod/22821   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.6  97 0.1 876  18  3K   0 mongod/23127   24426 mongodb  0.7 0.3 0.0 0.0 0.0 0.9  98 0.1 740  20  3K   0 mongod/22590   24426 mongodb  0.7 0.3 0.0 0.0 0.0 1.3  98 0.1 757  26  3K   0 mongod/22917  Total: 1 processes, 1013 lwps, load averages: 1.14, 0.39, 0.18     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID   24426 mongodb  0.0  38 0.0 0.0 0.0  61 0.0 0.0   8  81  15   0 mongod/3   24426 mongodb  1.6 0.7 0.0 0.0 0.0 5.2  92 0.2  2K  47 11K   0 mongod/23059   24426 mongodb  1.3 0.5 0.0 0.0 0.0 6.5  91 0.2  1K  84  8K   0 mongod/22978   24426 mongodb  1.1 0.4 0.0 0.0 0.0 6.9  91 0.2  1K 103  6K   0 mongod/23061   24426 mongodb  1.0 0.4 0.0 0.0 0.0 6.0  92 0.1  1K  43  6K   0 mongod/23182   24426 mongodb  1.0 0.4 0.0 0.0 0.0 4.6  94 0.1  1K  27  6K   0 mongod/22815   24426 mongodb  1.0 0.4 0.0 0.0 0.0 6.4  92 0.2  1K 122  6K   0 mongod/22393   24426 mongodb  1.0 0.4 0.0 0.0 0.0 5.7  93 0.1  1K  31  6K   0 mongod/22391   24426 mongodb  0.9 0.4 0.0 0.0 0.0 5.9  93 0.1  1K  61  6K   0 mongod/22802   24426 mongodb  1.0 0.3 0.0 0.0 0.0 6.4  92 0.1  1K  85  6K   0 mongod/23270   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.3  92 0.1  1K 109  6K   0 mongod/22871   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.6  92 0.1  1K  85  5K   0 mongod/22916   24426 mongodb  0.9 0.4 0.0 0.0 0.0 6.1  93 0.1  1K  26  6K   0 mongod/22517   24426 mongodb  0.9 0.4 0.0 0.0 0.0 5.9  93 0.1  1K  59  6K   0 mongod/22864   24426 mongodb  0.9 0.3 0.0 0.0 0.0 6.4  92 0.2  1K  97  6K   0 mongod/22459  Total: 1 processes, 1013 lwps, load averages: 1.20, 0.42, 0.19     PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID   24426 mongodb  0.0  59 0.0 0.0 0.0  40 0.1 0.0  19 230  18   0 mongod/3   24426 mongodb  0.8 0.3 0.0 0.0 0.0  28  71 0.1 936  39  4K   0 mongod/22526   24426 mongodb  0.8 0.3 0.0 0.0 0.0  24  75 0.1 889  22  4K   0 mongod/23089   24426 mongodb  0.7 0.3 0.0 0.0 0.0  25  74 0.1 809  13  4K   0 mongod/22838   24426 mongodb  0.7 0.3 0.0 0.0 0.0  27  72 0.1 946  23  4K   0 mongod/23077   24426 mongodb  0.7 0.3 0.0 0.0 0.0  24  75 0.1 859  22  3K   0 mongod/23274   24426 mongodb  0.6 0.3 0.0 0.0 0.0  29  70 0.1 933  40  4K   0 mongod/22814   24426 mongodb  0.6 0.4 0.0 0.0 0.0  77  22 0.0 267  15 800   0 mongod/5   24426 mongodb  0.6 0.2 0.0 0.0 0.0  29  70 0.1 645  32  3K   0 mongod/22790   24426 mongodb  0.6 0.2 0.0 0.0 0.0  27  72 0.2 701  11  3K   0 mongod/22600   24426 mongodb  0.6 0.2 0.0 0.0 0.0  28  71 0.1 832  28  3K   0 mongod/22413   24426 mongodb  0.6 0.2 0.0 0.0 0.0  25  74 0.1 560   5  3K   0 mongod/22415   24426 mongodb  0.5 0.2 0.0 0.0 0.0  28  71 0.2 651  28  3K   0 mongod/23035   24426 mongodb  0.5 0.2 0.0 0.0 0.0  27  72 0.1 544  38  2K   0 mongod/22738   24426 mongodb  0.5 0.2 0.0 0.0 0.0  28  71 0.1 594  22  3K   0 mongod/23273  Total: 1 processes, 1013 lwps, load averages: 1.23, 0.44, 0.20

Let's see what thread 3 is doing. When it runs, it spends a lot oftime in system mode (38 then 59 percent in the above output).We'll use DTrace to profile mongod to see what it is doing while insystem mode.

  # dtrace -n 'profile-97/pid == $target && tid == 3 &&  arg0/{@[stack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`  -o /var/tmp/mongod.tid3.out  #

And I'll download the output file and convert to a flamegraph (see https://github.com/davepacheco/node-stackvis).

  # stackvis < mongod.tid3.out > mongod.tid3.htm  #

Opening the flamegraph shows that thread id 3 is spending almost allof its time in the memcntl(2) system call.Let's take a look at how memcntl(2) gets called.

  # dtrace -n 'syscall::memcntl:entry/pid == $target && tid == 3/{@[ustack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod`  dtrace: description 'syscall::memcntl:entry' matched 2 probes  CPU     ID                    FUNCTION:NAME   13  76285                        :tick-60s                libc.so.1`memcntl+0xa                libc.so.1`msync+0xc9                mongod`_ZN5mongo14PosixFlushable5flushEv+0x3e                mongod`_ZN5mongo9MongoFile9_flushAllEb+0xfa                mongod`_ZN5mongo9MongoFile8flushAllEb+0x2b                mongod`_ZN5mongo12DataFileSync3runEv+0xf3                mongod`_ZN5mongo13BackgroundJob7jobBodyEv+0xe6                libboost_thread.so.1.55.0`thread_proxy+0x66                libc.so.1`_thrp_setup+0x8a                libc.so.1`_lwp_start                 19  #

Taking a look at the log file for mongod,(/var/log/mongodb/mongod.log), we see lots of times when mongod tooklong periods of time to flush the data files.

  # grep -i flush mongod.log  2014-06-23T16:12:40.792+0000 [DataFileSync] flushing mmaps took 20970ms  for 33 files  2014-06-23T18:57:32.343+0000 [DataFileSync] flushing mmaps took 12508ms  for 33 files  2014-07-19T17:51:37.175+0000 [DataFileSync] flushing mmaps took 11880ms  for 33 files  ...  #

And examining the source code for mongodb, the flush uses themsync(3c) library function to do the syncing (seesrc/mongo/util/mmap_posix.cpp for the code). The msync call in turncalls the memcntl(2) system call. And the larger the database, thelonger the flush (i.e., msync(3c)) times.

The msync() function takes three arguments: a starting address, alength, and flags. On SmartOS, the kernel does a page lookup for eachpage in the range between the starting address and starting addressplus length. On the database being loaded and run with 6 million1K-byte records, (i.e., on a 6GB database), there are 6GB/4KB (1.5million) page lookups on every flush.

As the database gets larger, this could effect performance.Fortunately in the case of mongodb, there is another, muchfaster way to sync the database to backing store. This is by usingfsync(3C). While msync flushes a range of modified pages to disk, fsync doesthe same for modified pages of a file. In the case of mongodb, theend result of using msync versus fsync is equivalent. Each msync callis for a 2GB area of memory that is mmap'd to a 2GB file. Instead ofcalling msync with the range of memory, fsync can be called with thefile descriptor for the mmap'd file. In the case of the fsync call, thereis no need to do a page lookup. The pages are hanging off of thevnode_t for the file, which can be quickly retrieved via the filedescriptor. This gets rid of the long flush times.

Another, more difficult problem, is that by going over the 8GB virtualmachine size, the memory capper for the zone kicks in. Performance atthis point was 2 orders of magnitude worse than when the machine wasbelow the memory cap. Basically, the benchmark did not scale wellwith respect to the size of the database. This led my colleague,Jerry Jelinek, to take a look to the memory capper. There were 2changes made to the memory capper. It no longer stops processes whilethey are being paged out, and it resumes running where it left off.

The result of these changes has made a vast difference. We no longersee the performance drop precipitously with respect to database size.Much more analysis can be done from this benchmark, but maybe inanother blog post (and maybe with a different database).



Post written by Max Bruning