Active Benchmarking in the Cloud
Recently, Joyent published a series of benchmarks using YCSB (Yahoo Cloud Service Benchmark)comparing the performance of MongoDB, PostgreSQL, and Elasticsearch between machines runningin the Joyent Public Cloud and similarly configured machines running on Amazon Web Services.
When I looked at the results, my first impression was that the numberslook good, but how can I tell what is really being benchmarked? To dothis, I needed to do "active benchmarking". This basically means thatI wanted to analyze the performance of the system while it was beingtested. I decided to look specifically at the YCSB benchmark that used MongoDB.
Rather than use shards, a router, and the entire setup that is used in the published results, I used a single machine in the Joyent cloud that ran the YCSB benchmark with another machine in the Joyent cloud that was running MongoDB. The machine running YCSB is a standard 4GB instance, and the machine running MongoDB is a standard 7.5GB system.
In this blog, we'll first simply run the benchmark and watch results.Then we'll do it again, but do active benchmarking while the benchmarkis running. We'll do this to try to determine what are the limitingfactors (i.e., what is the bottleneck for the test results). We'llalso take a look at scaling issues.
First, we'll run the benchmark to load 6 million records, each record consisting of 10 fields of 1 hundred bytes each, creating 1k of data per record.
# time ./bin/ycsb load mongodb -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p recordcount=6000000 -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -load Loading workload... Starting test. 0 sec: 0 operations; mongo connection created with 199.192.241.168:27017/ycsb 10 sec: 17563 operations; 1753.49 current ops/sec; [INSERT AverageLatency(us)=302209.69] 20 sec: 73255 operations; 5525.55 current ops/sec; [INSERT AverageLatency(us)=48674.83] 30 sec: 147128 operations; 7385.82 current ops/sec; [INSERT AverageLatency(us)=106637.21] 40 sec: 228286 operations; 8114.18 current ops/sec; [INSERT AverageLatency(us)=117745.16] ... 640 sec: 4993046 operations; 8210.88 current ops/sec; [INSERT AverageLatency(us)=123054.32] 650 sec: 5074073 operations; 8102.7 current ops/sec; [INSERT AverageLatency(us)=120990.57] 660 sec: 5150469 operations; 7638.84 current ops/sec; [INSERT AverageLatency(us)=127621.01] [CLEANUP AverageLatency(us)=26] 670 sec: 5223843 operations; 7336.67 current ops/sec; [INSERT AverageLatency(us)=144175.32] [CLEANUP AverageLatency(us)=6.33] 680 sec: 5297258 operations; 7340.77 current ops/sec; [INSERT AverageLatency(us)=131909.57] [CLEANUP AverageLatency(us)=4.67] 690 sec: 5376306 operations; 7904.01 current ops/sec; [INSERT AverageLatency(us)=124980.39] [CLEANUP AverageLatency(us)=5.62] ... [OVERALL], RunTime(ms), 771609.0 [OVERALL], Throughput(ops/sec), 7775.9590673514695 [INSERT], Operations, 6000000 [INSERT], AverageLatency(us), 122792.53285266667 [INSERT], MinLatency(us), 401 [INSERT], MaxLatency(us), 7160628 [INSERT], Return=0, 6000000 [INSERT], 0, 4387109 [INSERT], 1, 435791 ... [CLEANUP], Operations, 1000 [CLEANUP], AverageLatency(us), 70.878 [CLEANUP], MinLatency(us), 1 [CLEANUP], MaxLatency(us), 66032 [CLEANUP], 95thPercentileLatency(ms), 0 [CLEANUP], 99thPercentileLatency(ms), 0 [CLEANUP], 0, 999 [CLEANUP], 1, 0 [CLEANUP], 2, 0 ... real 12m52.267s user 13m33.122s sys 3m24.784s #
So loading 6,000,000 records took 12 minutes, 52 seconds.
Now we'll run workloada (a mixture of 50/50 reads/writes).
# time ./bin/ycsb run mongodb -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 java -cp /root/YCSB-master/infinispan/src/main/conf:/root/YCSB-master/mongodb/target/mongodb-binding-0.1.4.jar:/root/YCSB-master/mongodb/target/archive-tmp/mongodb-binding-0.1.4.jar:/root/YCSB-master/jdbc/src/main/conf:/root/YCSB-master/gemfire/src/main/conf:/root/YCSB-master/core/target/core-0.1.4.jar:/root/YCSB-master/dynamodb/conf:/root/YCSB-master/nosqldb/src/main/conf:/root/YCSB-master/voldemort/src/main/conf:/root/YCSB-master/hbase/src/main/conf com.yahoo.ycsb.Client -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloada -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t Loading workload... Starting test. 0 sec: 0 operations; mongo connection created with 199.192.241.168:27017/ycsb 10 sec: 11377 operations; 1130.24 current ops/sec; [UPDATE AverageLatency(us)=747864.1] [READ AverageLatency(us)=379445.04] 20 sec: 60430 operations; 4877.98 current ops/sec; [UPDATE AverageLatency(us)=86604.21] [READ AverageLatency(us)=69050.56] 30 sec: 135550 operations; 7480.58 current ops/sec; [UPDATE AverageLatency(us)=96481.73] [READ AverageLatency(us)=94490] 40 sec: 229001 operations; 9344.17 current ops/sec; [UPDATE AverageLatency(us)=68793.49] [READ AverageLatency(us)=64952.19] 50 sec: 328821 operations; 9920.49 current ops/sec; [UPDATE AverageLatency(us)=57943.78] [READ AverageLatency(us)=61490.36] 60 sec: 442787 operations; 11335.39 current ops/sec; [UPDATE AverageLatency(us)=68971.15] [READ AverageLatency(us)=60517.94] 70 sec: 555053 operations; 11201.96 current ops/sec; [UPDATE AverageLatency(us)=60592.3] [READ AverageLatency(us)=33707.96] 80 sec: 671583 operations; 11603.11 current ops/sec; [UPDATE AverageLatency(us)=74702.3] [READ AverageLatency(us)=46173.29] ... [OVERALL], RunTime(ms), 522700.0 [OVERALL], Throughput(ops/sec), 11478.859766596519 [UPDATE], Operations, 2999228 [UPDATE], AverageLatency(us), 85524.36139733292 [UPDATE], MinLatency(us), 394 [UPDATE], MaxLatency(us), 14435871 [UPDATE], 95thPercentileLatency(ms), 365 [UPDATE], Return=0, 2999228 [UPDATE], 0, 869530 ... [READ], Operations, 3000772 [READ], AverageLatency(us), 29704.56337702431 [READ], MinLatency(us), 353 [READ], MaxLatency(us), 14292934 [READ], 95thPercentileLatency(ms), 115 [READ], 99thPercentileLatency(ms), 345 [READ], Return=0, 3000772 [READ], 0, 706416 ... [CLEANUP], Operations, 1000 [CLEANUP], AverageLatency(us), 102.235 [CLEANUP], MinLatency(us), 1 [CLEANUP], MaxLatency(us), 81274 [CLEANUP], 95thPercentileLatency(ms), 0 [CLEANUP], 99thPercentileLatency(ms), 0 [CLEANUP], 0, 998 [CLEANUP], 1, 0 ... real 8m43.573s user 13m53.838s sys 3m40.015s #
So in 6 million read/write operations, there was 3,000,772 reads and 2,999,228 writes (close enough to a 50/50 split). Reads took on average about 29.7 msecs, with a minimum latency of 394 usecs and a maximum latency of 14 seconds. Updates (writes) took about the same amount of time.
Let's try workloadb, a 95/5 split between reads and writes (95% reads).
# time ./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 ... YCSB Client 0.1 Command line: -db com.yahoo.ycsb.db.MongoDbClient -s -P workloads/workloadb -p mongodb.url=mongodb://199.192.241.168:27017 -p insertorder=ordered -threads 1000 -p operationcount=6000000 -t Loading workload... Starting test. 0 sec: 0 operations; mongo connection created with 199.192.241.168:27017/ycsb 10 sec: 39561 operations; 3937.99 current ops/sec; [UPDATE AverageLatency(us)=64134.53] [READ AverageLatency(us)=77735.02] 20 sec: 128222 operations; 8806.22 current ops/sec; [UPDATE AverageLatency(us)=33152.18] [READ AverageLatency(us)=37736.67] 30 sec: 202620 operations; 7342.87 current ops/sec; [UPDATE AverageLatency(us)=75466.98] [READ AverageLatency(us)=58739.72] 40 sec: 286881 operations; 8426.1 current ops/sec; [UPDATE AverageLatency(us)=52512.36] [READ AverageLatency(us)=43191.99] 50 sec: 381094 operations; 9420.36 current ops/sec; [UPDATE AverageLatency(us)=31639.92] [READ AverageLatency(us)=26998.68] 60 sec: 483100 operations; 10199.58 current ops/sec; [UPDATE AverageLatency(us)=33720.43] [READ AverageLatency(us)=52603.76] 70 sec: 587825 operations; 10418.32 current ops/sec; [UPDATE AverageLatency(us)=42287.56] [READ AverageLatency(us)=37719.84] ... [OVERALL], RunTime(ms), 559369.0 [OVERALL], Throughput(ops/sec), 10726.37203706319 [UPDATE], Operations, 299901 [UPDATE], AverageLatency(us), 44403.52669714339 [UPDATE], MinLatency(us), 379 [UPDATE], MaxLatency(us), 3324334 [UPDATE], 95thPercentileLatency(ms), 138 [UPDATE], 99thPercentileLatency(ms), 740 [UPDATE], Return=0, 299901 [UPDATE], 0, 145122 [UPDATE], 1, 14669 ... [READ], Operations, 5700099 [READ], AverageLatency(us), 37128.780924682185 [READ], MinLatency(us), 329 [READ], MaxLatency(us), 3876008 [READ], 95thPercentileLatency(ms), 137 [READ], 99thPercentileLatency(ms), 381 [READ], Return=0, 5700099 [READ], 0, 2938426 [READ], 1, 193083 ... [CLEANUP], Operations, 1000 [CLEANUP], AverageLatency(us), 95.518 [CLEANUP], MinLatency(us), 1 [CLEANUP], MaxLatency(us), 91746 [CLEANUP], 95thPercentileLatency(ms), 0 [CLEANUP], 99thPercentileLatency(ms), 0 [CLEANUP], 0, 999 [CLEANUP], 1, 0 [CLEANUP], 2, 0 ... real 9m19.973s user 15m37.622s sys 3m28.291s #
Here, there are 5,700,099 read and 299,901 updates. Roughly a 95/5split as workloadb specifies. The remaining workloads yield similarresults.
Now we'll run the benchmark and begin to do some active benchmarking.For starters, let's use prstat(1M)
on the system where mongo is running.
# prstat -c Please wait... PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 24426 mongodb 70G 244M cpu21 60 0 3:06:37 12% mongod/2014 48484 root 3256K 2036K sleep 59 0 0:00:04 0.0% bash/1 42051 root 3272K 2248K sleep 59 0 0:00:00 0.0% bash/1 48278 root 5560K 1388K sleep 59 0 0:00:11 0.0% sshd/1 61323 root 7360K 2376K sleep 59 0 0:00:49 0.0% rsyslogd/6 41639 root 5560K 2904K sleep 59 0 0:00:00 0.0% sshd/1 59741 root 2676K 1388K sleep 59 0 0:00:06 0.0% init/1 61348 root 1652K 1008K sleep 59 0 0:00:04 0.0% utmpd/1 61357 root 1940K 1028K sleep 59 0 0:00:03 0.0% ttymon/1 59626 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1 41641 root 6224K 3744K sleep 59 0 0:00:00 0.0% sshd/1 61359 root 2276K 1060K sleep 59 0 0:00:02 0.0% ttymon/1 61082 root 2488K 1276K sleep 59 0 0:00:02 0.0% pfexecd/3 59793 root 8824K 6256K sleep 59 0 0:00:36 0.0% svc.configd/13 61378 root 4340K 1496K sleep 59 0 0:00:10 0.0% sshd/1 Total: 23 processes, 2100 lwps, load averages: 12.39, 7.57, 3.78 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 24426 mongodb 70G 242M cpu22 52 0 3:06:49 12% mongod/2014 76885 root 4064K 3276K cpu6 59 0 0:00:00 0.0% prstat/1 48484 root 3256K 2036K sleep 59 0 0:00:04 0.0% bash/1 42051 root 3272K 2248K sleep 59 0 0:00:00 0.0% bash/1 48278 root 5560K 1388K sleep 59 0 0:00:11 0.0% sshd/1 61323 root 7360K 2376K sleep 59 0 0:00:49 0.0% rsyslogd/6 41639 root 5560K 2904K sleep 59 0 0:00:00 0.0% sshd/1 59741 root 2676K 1388K sleep 59 0 0:00:06 0.0% init/1 61348 root 1652K 1008K sleep 59 0 0:00:04 0.0% utmpd/1 61357 root 1940K 1028K sleep 59 0 0:00:03 0.0% ttymon/1 59626 root 0K 0K sleep 60 - 0:00:00 0.0% zsched/1 41641 root 6224K 3744K sleep 59 0 0:00:00 0.0% sshd/1 61359 root 2276K 1060K sleep 59 0 0:00:02 0.0% ttymon/1 61082 root 2488K 1276K sleep 59 0 0:00:02 0.0% pfexecd/3 59793 root 8824K 6256K sleep 59 0 0:00:36 0.0% svc.configd/13 Total: 23 processes, 2100 lwps, load averages: 12.02, 7.57, 3.80 (^c) #
So mongod is using ~12% of the cpus on the machine. This machine has24 cpus, so this is approximately 2.8 cpus at 100% utilization.
# psrinfo 0 on-line since 11/22/2013 03:58:51 1 on-line since 11/22/2013 03:58:58 2 on-line since 11/22/2013 03:58:58 3 on-line since 11/22/2013 03:58:58 4 on-line since 11/22/2013 03:58:58 5 on-line since 11/22/2013 03:58:58 6 on-line since 11/22/2013 03:58:58 7 on-line since 11/22/2013 03:58:59 8 on-line since 11/22/2013 03:58:59 9 on-line since 11/22/2013 03:58:59 10 on-line since 11/22/2013 03:58:59 11 on-line since 11/22/2013 03:58:59 12 on-line since 11/22/2013 03:58:59 13 on-line since 11/22/2013 03:58:59 14 on-line since 11/22/2013 03:58:59 15 on-line since 11/22/2013 03:58:59 16 on-line since 11/22/2013 03:58:59 17 on-line since 11/22/2013 03:58:59 18 on-line since 11/22/2013 03:58:59 19 on-line since 11/22/2013 03:58:59 20 on-line since 11/22/2013 03:58:59 21 on-line since 11/22/2013 03:58:59 22 on-line since 11/22/2013 03:58:59 23 on-line since 11/22/2013 03:58:59 #
Let's take a quick look at other statistics (vmstat, vfsstat, mpstat,and iostat) on the system running mongo.
# vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr rm s0 s1 zn in sy cs us sy id 0 0 118 87846368 6706592 4408 50302 5 57 58 0 29 -5789 -1 74 74 18565 68963 14973 10 9 81 0 0 175 81510632 9860540 8639 97242 0 0 0 0 0 0 0 31 31 28396 129491 48858 13 9 78 0 0 175 81454632 9853364 8364 94341 0 0 0 0 0 0 0 18 18 31016 125756 47106 14 9 77 0 0 175 81518144 9862564 3537 23890 0 0 0 0 0 20 0 188 188 28130 139412 50791 17 6 77 0 0 175 81517996 9862384 840 3519 0 0 0 0 0 0 0 25 25 25988 132854 56871 11 4 85 0 0 175 81518020 9862768 3545 22590 0 0 0 0 0 0 0 15 15 24562 131160 51082 7 5 88 0 0 175 81508880 9860908 5840 79660 0 0 0 0 0 0 0 30 30 29503 124580 47687 12 8 80 (^C) #
The system as a whole is ~80% idle. There is no memory pressure onthe system, but lots of page faults. We may come back to this later.
Here's vfsstat:
# vfsstat 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 76.1 16.5 243.5 55.4 0.0 0.0 2.8 28.6 0 0 0.2 12.4 ded0419b (239) 43.7 0.0 4.9 0.0 0.0 0.0 2.9 0.0 0 0 0.0 0.0 ded0419b (239) 43.7 0.0 4.9 0.0 0.0 0.0 3.0 0.0 0 0 0.0 0.0 ded0419b (239) 43.5 5.8 4.9 417.6 0.0 0.0 3.0 128.1 0 0 0.0 0.0 ded0419b (239) 43.3 9.6 4.8 639.6 0.0 0.0 2.9 117.1 0 0 0.0 0.0 ded0419b (239) (^C) #
So, not much file system activity in the zone where mongod isrunning. The zone is not being I/O throttled (the 'd/s' and 'del_t'columns in the output). This is to be expected. Unlike many (most?) databases,mongoDB uses memory mapped files instead of read/write for databasedata. It periodically syncs the files to backing store.
Now mpstat:
# mpstat 1 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl ... CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 20718 0 4 2637 117 4506 190 394 245 0 8766 24 42 0 35 1 3787 0 44 512 54 5167 114 276 144 0 10722 18 14 0 68 2 13064 0 11 672 39 8919 364 659 206 0 17084 21 24 0 55 3 1338 0 46 897 186 7578 270 378 206 0 13350 13 13 0 74 4 5740 0 11 617 49 6637 281 467 155 0 10217 13 15 0 72 5 134 0 21 427 56 4342 91 238 82 0 5418 6 7 0 87 6 3418 0 11 780 100 8558 326 553 171 0 13850 15 16 0 69 7 425 0 23 679 37 8544 248 373 204 0 15762 12 11 0 77 8 1619 0 1 710 52 8241 341 430 115 0 13678 18 10 0 71 9 51 0 25 458 41 5290 197 262 176 0 7459 9 9 0 82 10 2476 0 4 748 66 9833 458 607 165 0 18239 18 13 0 68 11 1378 0 1 830 75 9630 546 380 147 0 13152 13 12 0 75 12 4315 0 1 620 65 5900 127 479 133 0 11299 20 15 0 66 13 1903 0 8 472 105 5247 91 309 191 0 10625 15 13 0 72 14 2016 0 5 390 30 5982 169 376 139 0 10792 18 16 0 66 15 6371 0 53 630 271 3790 50 202 190 0 7473 10 13 0 77 16 4501 0 5 15233 14632 4090 132 365 413 0 8432 32 37 0 31 17 2962 0 9 7298 6753 3446 83 234 329 0 13258 47 20 0 33 18 4545 0 494 402 26 4010 94 260 163 0 7613 34 21 0 45 19 2 0 645 440 114 4668 47 219 113 0 8205 12 6 0 82 20 1029 0 52 840 159 9325 278 606 117 0 18382 25 13 0 62 21 1120 0 1431 348 71 6572 94 325 140 0 15726 22 12 0 67 22 5110 0 4 886 177 6302 162 377 120 0 12207 21 13 0 66 23 2075 0 1246 304 45 3236 75 193 99 0 7788 15 9 0 76 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 1702 0 24 2553 140 8653 172 516 152 0 15491 14 13 0 73 1 520 0 624 475 79 6676 99 317 79 0 12575 14 8 0 78 2 2752 0 71 517 36 7173 151 465 103 0 14320 17 20 0 63 3 153 0 67 384 34 5310 181 261 141 0 8308 11 7 0 82 4 107 0 17 547 42 6262 218 368 97 0 8531 10 8 0 82 5 11 0 25 410 62 3691 78 210 64 0 4769 4 4 0 92 6 2216 0 38 774 97 7905 297 453 149 0 13275 12 11 0 77 7 7 0 17 561 52 7084 216 322 97 0 11671 14 8 0 78 8 18 0 58 508 31 5952 200 355 76 0 9057 9 7 0 84 9 6 0 55 396 57 5069 129 219 104 0 6318 5 5 0 90 10 42 0 79 644 65 8559 259 501 124 0 14886 12 9 0 79 11 385 0 87 756 102 9742 472 350 85 0 12346 10 9 0 81 12 1541 0 62 477 74 7014 116 543 92 0 14726 15 10 0 75 13 944 0 18 356 103 5551 48 329 96 0 9791 8 7 0 85 14 2 0 14 316 60 5367 74 347 62 0 9730 12 14 0 74 15 905 0 8 425 41 2503 67 128 43 0 5236 42 4 0 54 16 8 0 21 14268 13527 5552 135 413 294 0 11511 20 30 0 50 17 5 0 48 6972 6357 5826 93 304 337 0 11184 23 18 0 59 18 8 0 509 314 51 5479 81 299 60 0 10575 18 10 0 72 19 7 0 804 564 202 4123 31 194 65 0 13360 7 6 0 88 20 2 0 75 758 157 6378 152 356 64 0 11863 26 23 0 52 21 2287 0 453 339 56 5706 69 256 69 0 11302 21 8 0 70 22 3 0 64 867 172 5633 76 286 88 0 10716 11 18 0 71 23 63 0 302 305 74 4025 47 201 46 0 7674 8 4 0 87 #
The machine is busy, but over 50% idle. Nothing unusual here.
What system calls are being made by mongo?
# dtrace -n 'syscall:::entry/pid == $target/{@[probefunc]=count();} tick-60s{exit(0);}' -p `pgrep mongod` dtrace: description 'syscall:::entry' matched 235 probes CPU ID FUNCTION:NAME 16 76285 :tick-60s getpid 1 pwrite 1 ioctl 2 read 2 close 3 open 3 nanosleep 12 mmap 30 memcntl 47 mincore 78 fdsync 527 getcwd 527 lseek 529 fstat 531 lwp_mutex_timedlock 998 lwp_mutex_wakeup 1137 yield 4617 pollsys 5964 write 6124 gtime 28018 send 477650 recv 956294 lwp_park 3121961 clock_gettime 3590578 #
The (undocumented) lwp_park() system call is used for user levelsynchronization, and is called when a user level thread needs to block for auser level mutex, reader/writer lock, or condition variable. Theclock_gettime system call is made frequently, but should not consumetoo much time. The send/recv calls are interesting. The recv(2) callis done to retrieve requests from clients, and send(2) calls are madeto return responses to clients.
Snooping the network and saving the snoop output to a file, thenrunning wireshark on the captured output shows that the recv/sendcalls are probably all using the "Mongo Wire Protocol". Searching theinternet for "Mongo Wire Protocol" yieldshttp://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/.From this document, we'll implement a DTrace script to watch packetsbeing sent and received. The following script snoops packets anddisplays the message headers being sent/received.
#!/usr/sbin/dtrace -s /* assumes packets to/from mongodb */ #pragma D option quiet /* from http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/ struct MsgHeader { int32 messageLength; /* total message size, including this */ int32 requestID; /* identifier for this message */ int32 responseTo; /* requestID from the original request */ /* (used in responses from db) */ int32 opCode; /* request type - see table below */ }; BEGIN { printf("%10s: %-10s %-10s %-10s %-10s\n", "Direction", "Length", "ReqID", "ResTo", "opCode"); } /* * When send/recv(2) are called, assume packet is in "mongo wire protocol" * (see * http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/) */ syscall::send:entry /pid == $target/ { self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader)); self->mL = self->smh->messageLength; self->reqID = self->smh->requestID; self->resTo = self->smh->responseTo; self->opCode = self->smh->opCode; } syscall::send:entry /pid == $target && self->mL > 0/ { printf("%10s: %-10d %-10d %-10d %-10x\n", probefunc, self->mL, self->reqID, self->resTo, self->opCode); self->smh = 0; self->mL = 0; self->reqID = 0; self->resTo = 0; self->opCode = 0; } syscall::recv:entry /pid == $target/ { self->a1 = arg1; /* save user address of packet */ } syscall::recv:return /* now the packet is there. */ /self->a1/ { self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader)); self->mL = self->rmh->messageLength; self->reqID = self->rmh->requestID; self->resTo = self->rmh->responseTo; self->opCode = self->rmh->opCode; } syscall::recv:return /self->a1 && self->mL > 0/ { printf("%10s: %-10d %-10d %-10d %-10d\n", probefunc, self->mL, self->reqID, self->resTo, self->opCode); self->a1 = 0; self->mh = 0; self->mL = 0; self->reqID = 0; self->resTo = 0; self->opCode = 0; }
Running the above script while the ycsb benchmark is running shows:
#./mongo_snoop.d -p `pgrep mongod` Direction: Length ReqID ResTo opCode recv: 61 962617 0 2004 recv: 61 961087 0 2004 recv: 61 961883 0 2004 send: 110 480867 962617 1 send: 110 480868 961087 1 recv: 61 962405 0 2004 send: 110 480869 961883 1 recv: 61 961498 0 2004 send: 110 480870 962405 1 send: 110 480871 961498 1 recv: 61 962013 0 2004 recv: 61 962198 0 2004 send: 110 480872 962013 1 send: 110 480873 962198 1 recv: 61 961709 0 2004 recv: 1190 962736 0 2002 recv: 1190 962735 0 2002 recv: 61 961907 0 2004 send: 110 480874 961709 1 send: 110 480875 961907 1 recv: 1190 962739 0 2002 recv: 61 961771 0 2004 ...
In the above output, OpCode 1 is OP_REPLY, a reply sent to a clientrequest. The ResTo field matches the ReqID that was received from theclient. opCode 2004 is OP_QUERY, and 2002 is OP_INSERT. Thebenchmark is currently loading the database.
Here is a script that determines latency between a request beingreceived from a client, and the response for that request.
#!/usr/sbin/dtrace -s /* assumes packets to/from mongodb */ #pragma D option quiet struct MsgHeader { int32 messageLength; /* total message size, including this */ int32 requestID; /* identifier for this message */ int32 responseTo; /* requestID from the original request */ /* (used in reponses from db) */ int32 opCode; /* request type - see table below */ }; BEGIN { printf("wait 1 minute..."); } /* * When send/recv(2) are called, assume packet is in "mongo wire protocol" * (see * http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/) */ syscall::recv:entry /pid == $target/ { self->a1 = arg1; /* save user address of packet */ } syscall::recv:return /* now the packet is there. */ /self->a1 && arg1 > 0/ /* arg1 = number of bytes received */ { self->rmh = (struct MsgHeader *)copyin(self->a1, sizeof(struct MsgHeader)); self->mL = self->rmh->messageLength; self->reqID = self->rmh->requestID; } syscall::recv:return /self->a1 && self->mL > 0/ { t[self->reqID] = timestamp; self->a1 = 0; self->rmh = 0; self->mL = 0; self->reqID = 0; } syscall::send:entry /pid == $target/ { self->smh = (struct MsgHeader *)copyin(arg1, sizeof(struct MsgHeader)); self->mL = self->smh->messageLength; self->resTO = self->smh->responseTo; } syscall::send:entry /pid == $target && self->mL > 0 && t[self->resTO] != 0/ { @d["distribution in nsecs"] = quantize(timestamp-t[self->resTO]); @a["avg"] = avg(timestamp-t[self->resTO]); @t["total requests handled"] = count(); self->smh = 0; self->mL = 0; self->resTO = 0; t[self->resTO] = 0; } syscall::send:entry /pid == $target && self->mL > 0/ { self->smh = 0; self->mL = 0; self->resTO = 0; } tick-60s { exit(0); }
This script runs for one minute and shows the distribution of timestaken to handle database requests. Running the script shows:
# ./mongo_time.d -x dynvarsize=256m -p `pgrep mongod` wait 1 minute... distribution in nsecs value ------------- Distribution ------------- count 16384 | 0 32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ 304521 65536 |@@@@@@@@@@@@ 133288 131072 |@ 5643 262144 | 271 524288 | 6 1048576 | 0 avg 65688 total requests handled 443729 #
So requests take between 32 and 524 microseconds. Further informationcan be retrieved from other data in the requests and replies, but isleft as an exercise. The setting of dynvarsize is to get rid of"dynamic variable drops" messages.
Here is output of the above script while the test is running (asopposed to loading) the database.
./mongo\_time.d -x dynvarsize=256m -p `pgrep mongod` wait 1 minute... distribution in nsecs value ------------- Distribution ------------- count 16384 | 0 32768 |@@@@@@@@@@@@@@ 181524 65536 |@@@@@@@@@@@@ 155129 131072 |@@@@@ 64382 262144 |@@@ 39143 524288 |@@@ 38864 1048576 |@@ 25175 2097152 | 5632 4194304 | 1714 8388608 | 451 16777216 | 49 33554432 | 0 67108864 | 0 134217728 | 100 268435456 | 369 536870912 | 0 avg 548253 total requests handled 512532 #
Most of the requests take between 32 and 838 microseconds, but sometake between 134 and 536 milliseconds. The average latency is muchlonger when running the database as opposed to loading it. Another thing to note is thatwhen the database is loading, there are many OP_INSERT requests fromthe YCSB client. When it is running as opposed to loading, the client only sends OP_QUERYrequests (seen by running the mongo_snoop.d script from earlier).Most likely, the longer requests are updates as opposed to lookups,but that has not been verified (and as left to the reader as anexercise).
As an experiment, I ran both ycsb and mongo on the same machine, andsaw much higher ops/second numbers (over 30000). Running two ycsbclients on different machines did not yield a linear improvement(only around a 20% increase in total ops/second). Moreexperimentation is needed here. It would be interesting to see theeffects of multiple clients on the virtual machines actually used for thepublished results, but I have not tried this.
What happens if we increase the size of the database? Instead ofhaving a 6GB database, what if it were a database larger than the sizeof the virtual machine. The virtual machine is 8GB. Trying a 10GBdatabase introduced some problems.
Let's go back to prstat and watch the mongod process specifically, butusing the larger database.
# prstat -cmLp `pgrep mongod` PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24426 mongodb 2.0 0.8 0.0 0.0 0.0 2.1 95 0.3 2K 109 11K 0 mongod/22159 24426 mongodb 1.7 0.7 0.0 0.0 0.0 1.6 96 0.2 2K 32 10K 0 mongod/22115 24426 mongodb 1.6 0.6 0.0 0.0 0.0 1.7 96 0.2 1K 68 9K 0 mongod/22296 24426 mongodb 1.4 0.5 0.0 0.0 0.0 1.3 97 0.2 1K 61 7K 0 mongod/21496 24426 mongodb 1.3 0.5 0.0 0.0 0.0 2.0 96 0.2 1K 55 7K 0 mongod/22240 24426 mongodb 1.2 0.4 0.0 0.0 0.0 1.6 97 0.2 1K 59 6K 0 mongod/21649 24426 mongodb 1.2 0.4 0.0 0.0 0.0 1.7 97 0.2 1K 61 6K 0 mongod/22153 24426 mongodb 1.1 0.4 0.0 0.0 0.0 1.7 97 0.2 1K 52 6K 0 mongod/22293 24426 mongodb 1.1 0.4 0.0 0.0 0.0 1.0 97 0.1 1K 57 6K 0 mongod/22042 24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.6 97 0.2 1K 43 5K 0 mongod/21461 24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.5 97 0.1 1K 36 5K 0 mongod/22175 24426 mongodb 1.0 0.4 0.0 0.0 0.0 0.9 98 0.1 1K 17 6K 0 mongod/22073 24426 mongodb 1.0 0.4 0.0 0.0 0.0 0.6 98 0.1 1K 25 5K 0 mongod/22105 24426 mongodb 1.0 0.3 0.0 0.0 0.0 1.3 97 0.1 986 45 5K 0 mongod/22273 24426 mongodb 0.9 0.4 0.0 0.0 0.0 0.7 98 0.1 964 13 5K 0 mongod/21529 Total: 1 processes, 982 lwps, load averages: 0.39, 0.86, 3.93 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24426 mongodb 1.2 0.4 0.0 0.0 0.0 0.8 97 0.1 1K 26 6K 0 mongod/22041 24426 mongodb 0.9 0.4 0.0 0.0 0.0 0.8 98 0.1 1K 18 5K 0 mongod/22040 24426 mongodb 0.9 0.3 0.0 0.0 0.0 0.7 98 0.1 923 26 4K 0 mongod/21844 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 860 17 4K 0 mongod/22089 24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.0 98 0.1 907 32 4K 0 mongod/22175 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.6 98 0.1 869 18 4K 0 mongod/22121 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.5 98 0.1 837 22 4K 0 mongod/22028 24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.0 98 0.1 802 38 4K 0 mongod/22151 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 858 18 4K 0 mongod/21407 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.7 98 0.1 898 11 4K 0 mongod/22013 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 798 45 4K 0 mongod/22230 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.8 98 0.1 785 35 4K 0 mongod/22153 24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.2 98 0.1 870 40 4K 0 mongod/22248 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 784 43 4K 0 mongod/22245 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.0 98 0.1 872 28 4K 0 mongod/22195 Total: 1 processes, 982 lwps, load averages: 0.46, 0.87, 3.91 (^C) #
The mongod process has, at the time this was run, 982 threads. Mostof these threads are spending ~98% of the time sleeping. There is alittle contention for cpus (LAT column is non-null), and some time isspent waiting on locks (LCK is < 1%). But letting this run for alonger time shows that most threads spend lots of time sleeping, butthread id 3 and 5 spend time locking. This is a periodic activity.Here is output from prstat when thread 3 and 5 are busy.
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24426 mongodb 1.0 0.4 0.0 0.0 0.0 1.9 96 0.3 1K 53 5K 0 mongod/22503 24426 mongodb 0.9 0.4 0.0 0.0 0.0 1.8 97 0.2 1K 94 5K 0 mongod/22852 24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.2 98 0.1 936 24 4K 0 mongod/22540 24426 mongodb 0.7 0.4 0.0 0.0 0.0 99 0.3 0.0 273 11 975 0 mongod/5 24426 mongodb 0.8 0.3 0.0 0.0 0.0 1.1 98 0.1 881 56 4K 0 mongod/22700 24426 mongodb 0.8 0.3 0.0 0.0 0.0 2.0 97 0.2 1K 64 4K 0 mongod/22657 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.7 97 0.1 917 19 4K 0 mongod/22952 24426 mongodb 0.8 0.3 0.0 0.0 0.0 0.9 98 0.1 793 13 4K 0 mongod/22444 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.2 98 0.1 924 31 3K 0 mongod/23237 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.5 97 0.1 757 39 3K 0 mongod/22353 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.3 98 0.1 838 13 3K 0 mongod/22973 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.5 97 0.2 789 51 3K 0 mongod/22821 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.6 97 0.1 876 18 3K 0 mongod/23127 24426 mongodb 0.7 0.3 0.0 0.0 0.0 0.9 98 0.1 740 20 3K 0 mongod/22590 24426 mongodb 0.7 0.3 0.0 0.0 0.0 1.3 98 0.1 757 26 3K 0 mongod/22917 Total: 1 processes, 1013 lwps, load averages: 1.14, 0.39, 0.18 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24426 mongodb 0.0 38 0.0 0.0 0.0 61 0.0 0.0 8 81 15 0 mongod/3 24426 mongodb 1.6 0.7 0.0 0.0 0.0 5.2 92 0.2 2K 47 11K 0 mongod/23059 24426 mongodb 1.3 0.5 0.0 0.0 0.0 6.5 91 0.2 1K 84 8K 0 mongod/22978 24426 mongodb 1.1 0.4 0.0 0.0 0.0 6.9 91 0.2 1K 103 6K 0 mongod/23061 24426 mongodb 1.0 0.4 0.0 0.0 0.0 6.0 92 0.1 1K 43 6K 0 mongod/23182 24426 mongodb 1.0 0.4 0.0 0.0 0.0 4.6 94 0.1 1K 27 6K 0 mongod/22815 24426 mongodb 1.0 0.4 0.0 0.0 0.0 6.4 92 0.2 1K 122 6K 0 mongod/22393 24426 mongodb 1.0 0.4 0.0 0.0 0.0 5.7 93 0.1 1K 31 6K 0 mongod/22391 24426 mongodb 0.9 0.4 0.0 0.0 0.0 5.9 93 0.1 1K 61 6K 0 mongod/22802 24426 mongodb 1.0 0.3 0.0 0.0 0.0 6.4 92 0.1 1K 85 6K 0 mongod/23270 24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.3 92 0.1 1K 109 6K 0 mongod/22871 24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.6 92 0.1 1K 85 5K 0 mongod/22916 24426 mongodb 0.9 0.4 0.0 0.0 0.0 6.1 93 0.1 1K 26 6K 0 mongod/22517 24426 mongodb 0.9 0.4 0.0 0.0 0.0 5.9 93 0.1 1K 59 6K 0 mongod/22864 24426 mongodb 0.9 0.3 0.0 0.0 0.0 6.4 92 0.2 1K 97 6K 0 mongod/22459 Total: 1 processes, 1013 lwps, load averages: 1.20, 0.42, 0.19 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 24426 mongodb 0.0 59 0.0 0.0 0.0 40 0.1 0.0 19 230 18 0 mongod/3 24426 mongodb 0.8 0.3 0.0 0.0 0.0 28 71 0.1 936 39 4K 0 mongod/22526 24426 mongodb 0.8 0.3 0.0 0.0 0.0 24 75 0.1 889 22 4K 0 mongod/23089 24426 mongodb 0.7 0.3 0.0 0.0 0.0 25 74 0.1 809 13 4K 0 mongod/22838 24426 mongodb 0.7 0.3 0.0 0.0 0.0 27 72 0.1 946 23 4K 0 mongod/23077 24426 mongodb 0.7 0.3 0.0 0.0 0.0 24 75 0.1 859 22 3K 0 mongod/23274 24426 mongodb 0.6 0.3 0.0 0.0 0.0 29 70 0.1 933 40 4K 0 mongod/22814 24426 mongodb 0.6 0.4 0.0 0.0 0.0 77 22 0.0 267 15 800 0 mongod/5 24426 mongodb 0.6 0.2 0.0 0.0 0.0 29 70 0.1 645 32 3K 0 mongod/22790 24426 mongodb 0.6 0.2 0.0 0.0 0.0 27 72 0.2 701 11 3K 0 mongod/22600 24426 mongodb 0.6 0.2 0.0 0.0 0.0 28 71 0.1 832 28 3K 0 mongod/22413 24426 mongodb 0.6 0.2 0.0 0.0 0.0 25 74 0.1 560 5 3K 0 mongod/22415 24426 mongodb 0.5 0.2 0.0 0.0 0.0 28 71 0.2 651 28 3K 0 mongod/23035 24426 mongodb 0.5 0.2 0.0 0.0 0.0 27 72 0.1 544 38 2K 0 mongod/22738 24426 mongodb 0.5 0.2 0.0 0.0 0.0 28 71 0.1 594 22 3K 0 mongod/23273 Total: 1 processes, 1013 lwps, load averages: 1.23, 0.44, 0.20
Let's see what thread 3 is doing. When it runs, it spends a lot oftime in system mode (38 then 59 percent in the above output).We'll use DTrace to profile mongod to see what it is doing while insystem mode.
# dtrace -n 'profile-97/pid == $target && tid == 3 && arg0/{@[stack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod` -o /var/tmp/mongod.tid3.out #
And I'll download the output file and convert to a flamegraph (see https://github.com/davepacheco/node-stackvis).
# stackvis < mongod.tid3.out > mongod.tid3.htm #
Opening the flamegraph shows that thread id 3 is spending almost allof its time in the memcntl(2) system call.Let's take a look at how memcntl(2) gets called.
# dtrace -n 'syscall::memcntl:entry/pid == $target && tid == 3/{@[ustack()] = count();} tick-60s{exit(0);}' -p `pgrep mongod` dtrace: description 'syscall::memcntl:entry' matched 2 probes CPU ID FUNCTION:NAME 13 76285 :tick-60s libc.so.1`memcntl+0xa libc.so.1`msync+0xc9 mongod`_ZN5mongo14PosixFlushable5flushEv+0x3e mongod`_ZN5mongo9MongoFile9_flushAllEb+0xfa mongod`_ZN5mongo9MongoFile8flushAllEb+0x2b mongod`_ZN5mongo12DataFileSync3runEv+0xf3 mongod`_ZN5mongo13BackgroundJob7jobBodyEv+0xe6 libboost_thread.so.1.55.0`thread_proxy+0x66 libc.so.1`_thrp_setup+0x8a libc.so.1`_lwp_start 19 #
Taking a look at the log file for mongod,(/var/log/mongodb/mongod.log), we see lots of times when mongod tooklong periods of time to flush the data files.
# grep -i flush mongod.log 2014-06-23T16:12:40.792+0000 [DataFileSync] flushing mmaps took 20970ms for 33 files 2014-06-23T18:57:32.343+0000 [DataFileSync] flushing mmaps took 12508ms for 33 files 2014-07-19T17:51:37.175+0000 [DataFileSync] flushing mmaps took 11880ms for 33 files ... #
And examining the source code for mongodb, the flush uses themsync(3c) library function to do the syncing (seesrc/mongo/util/mmap_posix.cpp for the code). The msync call in turncalls the memcntl(2) system call. And the larger the database, thelonger the flush (i.e., msync(3c)) times.
The msync() function takes three arguments: a starting address, alength, and flags. On SmartOS, the kernel does a page lookup for eachpage in the range between the starting address and starting addressplus length. On the database being loaded and run with 6 million1K-byte records, (i.e., on a 6GB database), there are 6GB/4KB (1.5million) page lookups on every flush.
As the database gets larger, this could effect performance.Fortunately in the case of mongodb, there is another, muchfaster way to sync the database to backing store. This is by usingfsync(3C). While msync flushes a range of modified pages to disk, fsync doesthe same for modified pages of a file. In the case of mongodb, theend result of using msync versus fsync is equivalent. Each msync callis for a 2GB area of memory that is mmap'd to a 2GB file. Instead ofcalling msync with the range of memory, fsync can be called with thefile descriptor for the mmap'd file. In the case of the fsync call, thereis no need to do a page lookup. The pages are hanging off of thevnode_t for the file, which can be quickly retrieved via the filedescriptor. This gets rid of the long flush times.
Another, more difficult problem, is that by going over the 8GB virtualmachine size, the memory capper for the zone kicks in. Performance atthis point was 2 orders of magnitude worse than when the machine wasbelow the memory cap. Basically, the benchmark did not scale wellwith respect to the size of the database. This led my colleague,Jerry Jelinek, to take a look to the memory capper. There were 2changes made to the memory capper. It no longer stops processes whilethey are being paged out, and it resumes running where it left off.
The result of these changes has made a vast difference. We no longersee the performance drop precipitously with respect to database size.Much more analysis can be done from this benchmark, but maybe inanother blog post (and maybe with a different database).
Post written by Max Bruning