DTrace, MySQL, Ganglia, and Digging for Solutions

April 24, 2008 - by jason

I stay generic when blogging, but I had an interesting problem today that I thought I’d share. I’ve got a Ganglia rollout thats constantly growing. Some time ago Gmetad start struggling to keep up and would report nodes down (oddly, if you refreshed the page over and over you’d see the number of nodes roll up and down, 8 down, 7, 6, 4, 2, 0, 8…). The temporary fix was to simply reduce the number of monitored clusters. But this irritated me because gmetad really is pretty light on CPU and memory usage… Ganglia is synonimous with “scale”, so why won’t it scale?

# iostat -xn sd3 1

            extended device statistics

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

92.1  158.2 5460.8 2075.6  0.0  9.8    0.0   39.0   0  71 sd3

            extended device statistics

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

207.0  244.8 12990.2  509.0  0.0 15.2    0.0   33.5   0 100 sd3

            extended device statistics

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

64.0  330.8 4076.2 1083.7  0.0 19.2    0.0   48.7   0 100 sd3

I had noticed via iostat that disk IO was excessive for what the system was doing. Thats were I started. I could write some D script to explore, but Brendan Gregg’s DTrace Toolkit already contains useful scripts like iotop and iosnoop, so decided to save some time and use those. Here’s a snapshot from iotop:

2008 Apr 23 21:48:47,  load: 0.32,  disk_r:  34909 KB,  disk_w:  20108 KB

UID    PID   PPID CMD              DEVICE  MAJ MIN D            BYTES

0      3      0 fsflush          sd3      31 192 W           744448

100  24277  24180 mysqld           sd3      31 196 W          3801088

100  24277  24180 mysqld           sd3      31 196 R          4358144

0      0      0 sched            sd3      31 196 W         16045056

0      0      0 sched            sd3      31 196 R         31389188

MySQL is doing a lot more IO than it ought to be. I know that Zabbix is using MySQL but I’ve got several other apps that may be using it as well, so I’ll keep that on my hit list. The “sched” IO is actually cache flushing, typically from the ZFS ARC (ZFS’s buffers). So I need to look at what MySQL is doing and then at file IO but this time at the POSIX layer so that I can see which files are being touched.

MySQL first… I like my my_qcache.d script because I can see queries, execution time, and whether or not its being served from the query cache (Qcache). Here is the script:

#!/usr/sbin/dtrace -s

/*

*   Query Cache Viewer; benr@joyent.com

*/

#pragma D option quiet

BEGIN

{

printf("Waiting for queries... hit ^c to quit.n");

}

pid$target:mysqld:*dispatch_command*:entry

{

self->query = copyinstr(arg2);

self->start = timestamp;

}

pid$target:mysqld:*send_result_to_client*:entry

{

self->cache = "Yes";

}

pid$target:mysqld:*do_select*:entry

{

self->cache = "No";

}

pid$target:mysqld:*dispatch_command*:return

{

this->end = timestamp;

this->elapsed = (this->end - self->start)/1000000;

printf("From Query Cache?: %s in %d ms t| Query: %sn", self->cache, this->elapsed, self->query);

self->start = 0;

self->query = 0;

}

Here is a sample of what I got…

$ ./my_qcache.d -p `pgrep -x mysqld`

Waiting for queries... hit ^c to quit.

From Query Cache?: Yes in 0 ms  | Query: begin;

From Query Cache?: No in 0 ms   | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=1 and  httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)

From Query Cache?: No in 0 ms   | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=0 and  httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)

From Query Cache?: No in 0 ms   | Query: select httptestid,name,applicationid,nextcheck,status,delay,macros,agent from httptest where status=0 and nextcheck<=1208987526 and mod(httptestid,5)=3 and  httptestid>=100000000000000*0 and httptestid<=(100000000000000*0+99999999999999)

From Query Cache?: Yes in 0 ms  | Query: select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=3 and  t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999)

From Query Cache?: Yes in 0 ms  | Query: select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=1 and  t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999)

From Query Cache?: Yes in 0 ms  | Query: update items set nextcheck=1208988127, lastclock=1208987527 where itemid=20837

From Query Cache?: Yes in 10 ms         | Query: commit;

After looking at queries fly by its clear that all this is happening by Zabbix. A lot of data is being written, thats gonna happen, but there is also a lot of disk read that calls for some tuning. I’ll set that aside for now and given that Zabbix is an auxilary monitoring solution I’ll just shut it down for now to concentrate on Ganglia. After shutting down Zabbix I check the MySQL query activity again:

root@ev2-admin tmp$ ./my_qcache.d -p `pgrep -x mysqld`

Waiting for queries... hit ^c to quit.

^C

Ok, MySQL is quiet. Now I can dig into Ganglia without auxiliary noise.

I looked back at iostat and things didn’t look drastically different. How can this be? For a second I though that perhaps that Ganglia might simply be running out of File Descriptors, perhaps its opening lots of files (RRDs in this case) simultaneously and thats my cap? So I’ll check the resource controls (rctl’s) with a sepecial eye on the FD limit:

# prctl -i process `pgrep gmetad`

NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT

...

process.max-file-descriptor

basic             256       -   deny                             12528

privileged      65.5K       -   deny                                 -

system          2.15G     max   deny                                 -

...

So the “basic” limit is still in place limiting the process to 256 (POSIX standard) limit. The “recipient” is the PID, not a counter. Now, I can increase that threshold or simply dump the “basic” limit to allow the full “privileged” limit, but do I need to? Using pfiles I’ll check the number of open file descriptors… I run it over a couple times to see how much the number is changing over time:

# pfiles `pgrep gmetad`

12528:  /opt/ganglia/sbin/gmetad

Current rlimit: 256 file descriptors

0: S_IFREG mode:0444 dev:278,4 ino:5705 uid:60001 gid:0 size:336

O_RDONLY

/proc/12528/psinfo

1: S_IFSOCK mode:0666 dev:283,0 ino:8706 uid:0 gid:0 size:0

O_RDWR

SOCK_STREAM

SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152)

sockname: AF_INET 0.0.0.0  port: 8651

2: S_IFSOCK mode:0666 dev:283,0 ino:57377 uid:0 gid:0 size:0

O_RDWR

SOCK_STREAM

SO_REUSEADDR,SO_KEEPALIVE,SO_SNDBUF(49152),SO_RCVBUF(49152)

sockname: AF_INET 0.0.0.0  port: 8652

3: S_IFDOOR mode:0444 dev:286,0 ino:35 uid:0 gid:0 size:0

O_RDONLY|O_LARGEFILE FD_CLOEXEC  door to nscd[24142]

/var/run/name_service_door

Well, ok, so clearly the number of file descriptors isn’t the issue…. but just in case I wanted to rule out the issue completely. While I could use DTrace, this isn’t a production impacting application so I did a quick naughty look with truss:

# truss -p  `pgrep gmetad`

...

/125:   mkdir("/opt/ganglia/data/xxxxxxxxx", 0755)     Err#17 EEXIST

/125:   mkdir("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy", 0755) Err#17 EEXIST

/125:   xstat(2, "/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/load_fifteen.rrd", 0xF6F3B874) = 0

/125:   open64("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/load_fifteen.rrd", O_RDWR) = 7

/125:   fstat64(7, 0xF6F3B200)                          = 0

/125:   fstat64(7, 0xF6F3B130)                          = 0

/125:   ioctl(7, TCGETA, 0xF6F3B1CC)                    Err#25 ENOTTY

/125:   read(7, " R R D? 0 0 0 3????".., 4096)     = 4096

/125:   llseek(7, 0, SEEK_CUR)                          = 4096

/125:   lseek(7, 0, SEEK_END)                           = 12112

/125:   llseek(7, 0, SEEK_CUR)                          = 12112

/125:   llseek(7, 0, SEEK_CUR)                          = 12112

/125:   lseek(7, 1312, SEEK_SET)                        = 1312

/125:   fcntl(7, F_SETLK64, 0xF6F3B314)                 = 0

/125:   mmap64(0x00000000, 12112, PROT_READ|PROT_WRITE, MAP_SHARED, 7, 0) = 0xF6B30000

/125:   munmap(0xF6B30000, 12112)                       = 0

/125:   llseek(7, 0, SEEK_CUR)                          = 1312

/125:   lseek(7, 772, SEEK_SET)                         = 772

/125:   write(7, "9DCC0F H???? 1 . 5 5".., 540)     = 540

/125:   close(7)                                        = 0

/125:   mkdir("/opt/ganglia/data/xxxxxxxxx", 0755)     Err#17 EEXIST

/125:   mkdir("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy", 0755) Err#17 EEXIST

/125:   xstat(2, "/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/proc_total.rrd", 0xF6F3B874) = 0

/125:   open64("/opt/ganglia/data/xxxxxxxxx/yyyyyyyyyyy/proc_total.rrd", O_RDWR) = 7

/125:   fstat64(7, 0xF6F3B200)                          = 0

/125:   fstat64(7, 0xF6F3B130)                          = 0

/125:   ioctl(7, TCGETA, 0xF6F3B1CC)                    Err#25 ENOTTY

/125:   read(7, " R R D? 0 0 0 3????".., 4096)     = 4096

/125:   llseek(7, 0, SEEK_CUR)                          = 4096

/125:   lseek(7, 0, SEEK_END)                           = 12112

^C/125: ioctl(7, TCGETA, 0xF6F3B1CC)                    Err#25 ENOTTY

...

So its doing a lot of IO and its happening sequentially. A file is opened, worked on, closed, next… so clearly this shows that file descriptor limits are not an issue here.

Truss is a quick and dirty tool, but I want a little more percision. Lets try something out of Brendan’s bag of tricks:

$ ./fsrw.d

Event            Device RW     Size Offset Path

sc-read               .  R     4096      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4x

fop_read            .  R     4096      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4x

sc-write              .  W      540      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4x

fop_write           .  W      540      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.disk_total.rrd.Imbi4x

sc-read               .  R     4096      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4x

fop_read            .  R     4096      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4x

sc-write              .  W      540      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4x

fop_write           .  W      540      0 /zones/z000000AA/root/opt/ganglia/data/xxxxxxxx/yyyyyyyyyyyy/.cpu_speed.rrd.Dmbi4x

So fsrw.d from the DTrace Toolkit spits lines flying by… we’re doing a lot more IO that I realized and we’re doing it small chunks. Looking back at my iostat output from earlier I notice that the number of ops shown was low, but we’re doing a ton of IO… ZFS is clearly doing a good job of grouping and caching IO for me. The disk just can’t seem to keep up, and I’ve still got more than half of my Ganglia nodes unmonitored meaning that I need to at least double the IO load to solve my problem.

Lets look at that iostat output again:

   r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

...

92.1  158.2 5460.8 2075.6  0.0  9.8    0.0   39.0   0  71 sd3

207.0  244.8 12990.2  509.0  0.0 15.2    0.0   33.5   0 100 sd3

64.0  330.8 4076.2 1083.7  0.0 19.2    0.0   48.7   0 100 sd3

These are 1 second intervals. The throughput is low, the IO size (kw/s divided by w/s) isn’t terrible, but my asvc_t (average service time in ms) is higher than I’d expect on local 10K SAS drives, combined with the 100% busy. Combine those two facts, 100% busy and asvc_t thats significantly higher than average latency (~10ms) on the drive and I know that the drive is moving that disk head way more than it should, we’re dealing with a lot of scattered blocks on disk clearly. I could use DTrace to look at the LBA on IO’s but I don’t think I need to go that far. The problem is clear, the disk just can’t keep up with the flood of IO from Ganglia. Normally this is where I’d go looking for a new RAID controller or tuning any kernel parameter I can find… but we’re not talking about a lot of data here, these are RRD files, the total size of my Ganglia data/ directory is less than 50MB!

One way to solve this problem would be to get a really big RAID controller, but thats expensive. Besides, how does it really speed things up? Caches! Well, hell, I’ve already got a lot of free memory on this system and I’m not dealing with gigs of data… why not just store Ganglia data file in system memory? If I did that I’d provide more IO speed to Ganglia than it could exceed and free up my disks to do something more useful.

Enter tmpfs. Rather than code some nifty caching mechanism into gmetad, I could just mount a tmpfs segment as its data directory. Seems silly at first, but it would be fast and looking at Web 2.0 scalability apps like Varnish, Memcached, etc, they’ll store anything in memory… why not this? So the only trick is sync’ing the data in memory to disk periodically in case of a crash.

So I wrote up a startup wrapper for gmetad (which is invoked as an SMF method) that looks like this:

#!/usr/bin/bash

## SMF Start method for Gmetad

## -benr

if ( mount | grep "/opt/ganglia/data on swap" >/dev/null )

then

 echo "Tmpfs already mounted."

else

 echo "Mounting tmpfs..."

 chown nobody /opt/ganglia/data

 mount -F tmpfs -o size=100m,noxattr swap /opt/ganglia/data

fi

## Now sync the data in if its empty:

if [ -d /opt/ganglia/data/__SummaryInfo__/ ]

then

 echo "Cache primed, ready to start."

 chown -R nobody /opt/ganglia/data

else

 echo "Priming the cache..."

 /opt/csw/bin/rsync -at /opt/ganglia/data-disk/ /opt/ganglia/data/

 chown -R nobody /opt/ganglia/data

fi

## Finally, start ganglia:

/opt/ganglia/sbin/gmetad

Now I just add a cronjob to rsync the tmpfs data every 5 minutes:

# Sync gmetad ram-disk to physical-disk

#

5,10,15,20,25,30,35,40,45,50,55,0 * * * * /opt/csw/bin/rsync -at /opt/ganglia/data/ /opt/ganglia/data-disk/

After about 20 minutes of tinkering I get things right, stop and restart Ganglia, but this time with all the clusters enabled. To my delight the disk was getting a much needed vacation, the system performance was smoothed significantly because the IO system wasn’t busy all the time, Ganglia was working perfectly and faster than ever, and on top of it all I’ve increased the life of my disk because it’s not pounding the heads 24 hours a day.

After sharing my solution a bit I found that a lot of people are using tmpfs for RDD storage for the same reason I am. Wish they’d told me before. :)

We’ve seen here how powerful the observability capabilities of Solaris are, how DTrace helped us to understand both IO activity and even peek into a MySQL database, how iostat helped us understand what my hard drives were doing as well as how truss, rctls and the Solaris ptools (pfiles in this case) helped us investigate and rule out some hutches along the way. Furthermore, we’ve seen how we can exploit ‘tmpfs’ to trade a little memory for some big wins in terms of both longevity and performance, without having to write a line of code, just a couple of commands in a shell script combined with a cron job.

Solaris for the win. ;)

:

Sign up now for Instant Cloud Access Get Started