Memory Capping on SmartOS

I have a question from Mike Zeller (@papertigerss), who says:

Hey Max,

I was playing around with some of the stuff from internals training. I noticed one of my zones at home was swapping against itself because of its cap. It seems like the mdb thread walk doesn't actually think those threads have swapped out in that zone. I was wondering if there was a way to get that info from the GZ or from within the zone itself.

Thanks,Mike

metroid == GZnotch == zone[root@metroid ~]# zonememstat                                 ZONE  RSS(MB)  CAP(MB)    NOVER  POUT(MB)                               global      128        -        -         - 6fef98f8-9b12-46ed-9e2f-c3aabba0d7d9     1060     2048        0         0 cdf60754-a789-40b5-8028-61b4553d4a78     2240     3072     6349     49740[root@notch ~]# zonememstat                                 ZONE  RSS(MB)  CAP(MB)    NOVER  POUT(MB) cdf60754-a789-40b5-8028-61b4553d4a78     2240     3072     6349     49740[root@metroid ~]# mdb -kLoading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci ufs ip hookneti sockfs arp usba stmf_sbd stmf zfs lofs sd idm crypto random cpc logindmux ptm kvm sppp nsmbsmbsrv nfs sata ]> ::walk thread a | ::print -t kthread_t t_schedflag | ::grep "(.&1)==0" | ::eval "

Mike shows, using mdb -k, that there are no threads that have been swapped out. So thequestion is, what happens when the memory being used by a zone exceeds the cap?

To explain this, we'll create a zone with a small memory cap, then run an application which exceeds the cap to see what we can find out.

To follow along, you can download SmartOS from here and set up a zone by following wiki.smartos.org/display/DOC/How+to+create+a+zone+%28+OS+virtualized+machine+%29+in+SmartOS. Or you can provision a SmartMachine at joyent.com. You can do the steps I'm going to go through in less than an hour so if you destroy the machine when you're done, the cost should be pennies. For this, the smallest SmartMachine size works fine. I set up a SmartMachine on SmartOS with 256MB.

# vmadm list -vUUID                                  TYPE  RAM      STATE             ALIAS03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3  OS    256      running           -fe0c65e1-0f61-4291-99e1-9d7a75dd091e  OS    32768    running           -# zlogin 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3[Connected to zone '03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3' pts/5]Last login: Fri Jan  4 15:32:29 on pts/2   __        .                   . _|  |_      | .-. .  . .-. :--. |-|_    _|     ;|   ||  |(.-' |  | |  |__|   `--'  `-' `;-| `-' '  ' `-'                   /  ; SmartMachine (standard64 1.0.7)                   `-'  http://wiki.joyent.com/jpc2/SmartMachine+Standard[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]#

In general, memory capping is done by a daemon. If the daemon detects that a zone has gone over its cap, it decides on pages to push out to free up some of the memory in the zone. In SmartOS, we also cause page faults (pages being brought into memory) for the zone to be delayed (see the use of zone_pg_flt_delay in the as_fault() kernel routine, available . There is a long comment at the beginning of github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/zoneadmd/mcap.c which explains how things work. Maybe the most interesting part of the comment is at the end. You can turn on logging to see which processes are being examined and how much memory is being freed for each one. To do this, you need to be in the global zone.

# touch /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log

The mechanism used is slightly different between SmartOS and other solaris-like boxes. See smartos.org/2012/07/03/smartos-max-brunings-talk-at-nosig for a description of the mechanism on SmartOS versus other Solaris.

Here, we'll do something to go over the cap of 256MB.

[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M &[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M &[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=64M &

(Note the above is done in the 256MB zone)

Let's see what zonememstat reports:

[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# zonememstat                                                               ZONE  RSS(MB)  CAP(MB)    NOVERPOUT(MB) 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3      220      256        1       150

The zone has gone over its memory cap 1 time, and 150MB of memory were paged out.

We'll wait a few minutes, and take a look at the mcap_debug.log file (from the global zone).

# cat /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log...  <-- output omittedfast rss 265364KBrss 242292KB, cap 262144KB, excess -19852KB  <-- plenty of room under the capsleep 30 secondsfast rss 189868KBrss 189868KB, cap 262144KB, excess -72276KBsleep 120 secondsphys-mcap-cmd:phys-mcap-no-vmusage: 0phys-mcap-no-pageout: 0current cap 262144KB lo 209715KB hi 235929KBfast rss 398384KBrss 374156KB, cap 262144KB, excess 112012KB  <-- over the cap by 112012KBpid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd  <-- start scanning processes where weleft offpid 6735: unp 0 att 2552KB drss 0KB excess 125119KBpid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql --basedir=/opt/local --datadir=/var/mysql -pid 6833: unp 0 att 42032KB drss -544KB excess 124575KBpid 6377: nmap 32 sz 2136KB rss 1232KB /sbin/initpid 6377: unp 0 att 748KB drss -44KB excess 124531KB...pid: 6320 system process, skipping zsched...pid 6391: nmap 90 sz 10660KB rss 3084KB /lib/svc/bin/svc.configd  <-- this is the last process(from ps -e)pid 6391: unp 0 att 3700KB drss -320KB excess 121619KBprocess pass done; excess 121619   <-- have scanned the rest of the processes from where westartedfast rss 395064KBrss 372284KB, cap 262144KB, excess 110140KB  <-- still over the capstarting to scan, excess 123247k  <-- start scanning from the beginningpid 6701: nmap 25 sz 1932KB rss 1052KB /usr/lib/saf/ttymon -g -d /dev/console -l console -T vt100-m ldterm,ttcompat -pid 6701: unp 0 att 948KB drss 0KB excess 123247KBpid 6469: nmap 58 sz 3912KB rss 1436KB /lib/inet/ipmgmtdpid 6469: unp 0 att 2368KB drss 0KB excess 123247KBpid 6686: nmap 17 sz 1632KB rss 992KB /usr/lib/utmpdpid 6686: unp 0 att 336KB drss -28KB excess 123219KB...pid 7786: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128M  <-- one of thedd'spid 7786: unp 0 att 131184KB drss -131184KB excess -8149KB  <-- and we freed up most of the spaceapparently under; excess -8149  <-- now under the capfast rss 394992KBrss 371348KB, cap 262144KB, excess 109204KB  <-- but the dd is still running, so we're back over!pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd  <-- continue where we left offpid 6735: unp 0 att 2552KB drss 0KB excess 122311KBpid 6833: nmap 61 sz 113940KB rss 20004KB /opt/local/sbin/mysqld --user=mysql --basedir=/opt/local --datadir=/var/mysql -pid 6833: unp 0 att 42356KB drss -576KB excess 121735KB...pid 7787: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128Mpid 7787: unp 0 att 131184KB drss -131184KB excess -9761KBapparently under; excess -9761  <-- and we're back underfast rss 261940KBrss 261940KB, cap 262144KB, excess -204KB  <-- still under, but the dd's are eating it up againsleep 30 seconds  <-- but we were under, so sleep for a bitfast rss 384872KBrss 362560KB, cap 262144KB, excess 100416KB  <-- over againpid 6714: nmap 22 sz 2120KB rss 1252KB /usr/lib/saf/sac -t 300pid 6714: unp 0 att 576KB drss -24KB excess 113499KB...

So, if the zone is under the memory cap, and stays under the cap, a thread in zoneadmd for the zone periodically checks to see if the zone is over the cap. It uses a "fast" algorithm, which is an approximation that may exceed the actual value of used memory (it may count pages shared between processes multiple times). If the fast algorithm shows the zone is over the cap, it uses a more expensive but more accurate algorithm to determine how much the zone is over its cap. The thread then starts scanning processes looking for memory that can be freed. It starts the scan from where it last left off. The order it scans the processes is the same as seen by doing ps -e in the zone.

Let's take a closer look at some of the lines of output above.

fast rss 265364KB
  • The resident set size (physical memory used by the zone) calculated using the "fast" algorithm.

    rss 242292KB, cap 262144KB, excess -19852KBsleep 30 seconds

  • The fast algorithm said it's over the cap, but the accurate algorithm says we are 19852KB under the cap. So we'll sleep 30 seconds and look again.

    fast rss 398384KBrss 374156KB, cap 262144KB, excess 112012KB

  • Now, we're really over the cap. Time to scan processes.

    pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshdpid 6735: unp 0 att 2552KB drss 0KB excess 125119KBpid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql --basedir=/opt/local --datadir=/var/mysql -pid 6833: unp 0 att 42032KB drss -544KB excess 124575KB

  • pid -> process id.

  • nmap -> number of segments in the address space of this process. See pmap(1).

  • sz -> Virtual address space size for the process.

  • rss -> Current amount of physical memory for the process.

  • The command line args for the process

  • unp -> unpageable mappings

  • att -> attempted amount of memory to free. Basically, it tries to free the segments of the address space.

  • drss -> delta RSS. The amount subtracted from the RSS (i.e., the amount freed)

  • excess -> current amount over (under if negative) the cap

Mike sees no threads swapped out because the system itself has plenty of free memory. Swapping only occurs when the entire system is memory stressed, not just some zone that has gone over its cap. If you don't have access to the global zone, you won't be able to see the mcap_debug.log file. However, you might be able to use prstat to see the RSS values for processes in your zone adjusting downward as zoneadmd does its work.

The mechanism described above is specific to SmartOS. Other Solaris and Solaris-like OSes do this differently.

We offer comprehensive training for Triton Developers, Operators and End Users.



Post written by Mr. Max Bruning