Bruning Questions: Memory Capping on SmartOS

January 04, 2013 - by Mr. Max Bruning

ask-mr-bruning-logo

I have a question from Mike Zeller (@papertigerss), who says:

Hey Max,

I was playing around with some of the stuff from internals training. I noticed one of my zones at home was swapping against itself because of its cap. It seems like the mdb thread walk doesn't actually think those threads have swapped out in that zone. I was wondering if there was a way to get that info from the GZ or from within the zone itself.

Thanks, Mike

metroid == GZ
notch == zone

[root@metroid ~]# zonememstat 
                                 ZONE  RSS(MB)  CAP(MB)    NOVER  POUT(MB)
                               global      128        -        -         -
 6fef98f8-9b12-46ed-9e2f-c3aabba0d7d9     1060     2048        0         0
 cdf60754-a789-40b5-8028-61b4553d4a78     2240     3072     6349     49740


[root@notch ~]# zonememstat 
                                 ZONE  RSS(MB)  CAP(MB)    NOVER  POUT(MB)
 cdf60754-a789-40b5-8028-61b4553d4a78     2240     3072     6349     49740


[root@metroid ~]# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci ufs ip hook    
neti sockfs arp usba stmf_sbd stmf zfs lofs sd idm crypto random cpc logindmux ptm kvm sppp nsmb   
smbsrv nfs sata ]
> ::walk thread a | ::print -t kthread_t t_schedflag | ::grep "(.&1)==0" | ::eval "<a::print   
kthread_t t_procp | ::print proc_t p_user.u_psargs"
>

Mike shows, using mdb -k, that there are no threads that have been swapped out. So the question is, what happens when the memory being used by a zone exceeds the cap?

To explain this, we'll create a zone with a small memory cap, then run an application which exceeds the cap to see what we can find out.

To follow along, you can download SmartOS from here and set up a zone by following http://wiki.smartos.org/display/DOC/How+to+create+a+zone+%28+OS+virtualized+machine+%29+in+SmartOS. Or you can provision a SmartMachine at http://joyent.com/. You can do the steps I'm going to go through in less than an hour so if you destroy the machine when you're done, the cost should be pennies. For this, the smallest SmartMachine size works fine. I set up a SmartMachine on SmartOS with 256MB.

# vmadm list -v
UUID                                  TYPE  RAM      STATE             ALIAS
03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3  OS    256      running           -
fe0c65e1-0f61-4291-99e1-9d7a75dd091e  OS    32768    running           -
# zlogin 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3
[Connected to zone '03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3' pts/5]
Last login: Fri Jan  4 15:32:29 on pts/2
   __        .                   .
 _|  |_      | .-. .  . .-. :--. |-
|_    _|     ;|   ||  |(.-' |  | |
  |__|   `--'  `-' `;-| `-' '  ' `-'
                   /  ; SmartMachine (standard64 1.0.7)
                   `-'  http://wiki.joyent.com/jpc2/SmartMachine+Standard

[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]#

In general, memory capping is done by a daemon. If the daemon detects that a zone has gone over its cap, it decides on pages to push out to free up some of the memory in the zone. In SmartOS, we also cause page faults (pages being brought into memory) for the zone to be delayed (see the use of zone_pg_flt_delay in the as_fault() kernel routine, available . There is a long comment at the beginning of https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/zoneadmd/mcap.c which explains how things work. Maybe the most interesting part of the comment is at the end. You can turn on logging to see which processes are being examined and how much memory is being freed for each one. To do this, you need to be in the global zone.

# touch /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log

The mechanism used is slightly different between SmartOS and other solaris-like boxes. See http://smartos.org/2012/07/03/smartos-max-brunings-talk-at-nosig/ for a description of the mechanism on SmartOS versus other Solaris.

Here, we'll do something to go over the cap of 256MB.

[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M &
[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M &
[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=64M &

(Note the above is done in the 256MB zone)

Let's see what zonememstat reports:

[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# zonememstat
                                                               ZONE  RSS(MB)  CAP(MB)    NOVER        
POUT(MB)
 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3      220      256        1       150

The zone has gone over its memory cap 1 time, and 150MB of memory were paged out.

We'll wait a few minutes, and take a look at the mcap_debug.log file (from the global zone).

# cat /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log
...  <-- output omitted
fast rss 265364KB
rss 242292KB, cap 262144KB, excess -19852KB  <-- plenty of room under the cap
sleep 30 seconds
fast rss 189868KB
rss 189868KB, cap 262144KB, excess -72276KB
sleep 120 seconds
phys-mcap-cmd: 
phys-mcap-no-vmusage: 0
phys-mcap-no-pageout: 0
current cap 262144KB lo 209715KB hi 235929KB
fast rss 398384KB
rss 374156KB, cap 262144KB, excess 112012KB  <-- over the cap by 112012KB
pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd  <-- start scanning processes where we     
left off
pid 6735: unp 0 att 2552KB drss 0KB excess 125119KB
pid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql --    
basedir=/opt/local --datadir=/var/mysql -
pid 6833: unp 0 att 42032KB drss -544KB excess 124575KB
pid 6377: nmap 32 sz 2136KB rss 1232KB /sbin/init
pid 6377: unp 0 att 748KB drss -44KB excess 124531KB
...
pid: 6320 system process, skipping zsched
...
pid 6391: nmap 90 sz 10660KB rss 3084KB /lib/svc/bin/svc.configd  <-- this is the last process 
(from ps -e)
pid 6391: unp 0 att 3700KB drss -320KB excess 121619KB
process pass done; excess 121619   <-- have scanned the rest of the processes from where we     
started
fast rss 395064KB
rss 372284KB, cap 262144KB, excess 110140KB  <-- still over the cap
starting to scan, excess 123247k  <-- start scanning from the beginning
pid 6701: nmap 25 sz 1932KB rss 1052KB /usr/lib/saf/ttymon -g -d /dev/console -l console -T vt100         
-m ldterm,ttcompat -
pid 6701: unp 0 att 948KB drss 0KB excess 123247KB
pid 6469: nmap 58 sz 3912KB rss 1436KB /lib/inet/ipmgmtd
pid 6469: unp 0 att 2368KB drss 0KB excess 123247KB
pid 6686: nmap 17 sz 1632KB rss 992KB /usr/lib/utmpd
pid 6686: unp 0 att 336KB drss -28KB excess 123219KB
...
pid 7786: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128M  <-- one of the     
dd's
pid 7786: unp 0 att 131184KB drss -131184KB excess -8149KB  <-- and we freed up most of the space
apparently under; excess -8149  <-- now under the cap
fast rss 394992KB
rss 371348KB, cap 262144KB, excess 109204KB  <-- but the dd is still running, so we're back over!
pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd  <-- continue where we left off
pid 6735: unp 0 att 2552KB drss 0KB excess 122311KB
pid 6833: nmap 61 sz 113940KB rss 20004KB /opt/local/sbin/mysqld --user=mysql --    
basedir=/opt/local --datadir=/var/mysql -
pid 6833: unp 0 att 42356KB drss -576KB excess 121735KB
...
pid 7787: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128M
pid 7787: unp 0 att 131184KB drss -131184KB excess -9761KB
apparently under; excess -9761  <-- and we're back under
fast rss 261940KB
rss 261940KB, cap 262144KB, excess -204KB  <-- still under, but the dd's are eating it up again
sleep 30 seconds  <-- but we were under, so sleep for a bit
fast rss 384872KB
rss 362560KB, cap 262144KB, excess 100416KB  <-- over again
pid 6714: nmap 22 sz 2120KB rss 1252KB /usr/lib/saf/sac -t 300
pid 6714: unp 0 att 576KB drss -24KB excess 113499KB
...

So, if the zone is under the memory cap, and stays under the cap, a thread in zoneadmd for the zone periodically checks to see if the zone is over the cap. It uses a "fast" algorithm, which is an approximation that may exceed the actual value of used memory (it may count pages shared between processes multiple times). If the fast algorithm shows the zone is over the cap, it uses a more expensive but more accurate algorithm to determine how much the zone is over its cap. The thread then starts scanning processes looking for memory that can be freed. It starts the scan from where it last left off. The order it scans the processes is the same as seen by doing ps -e in the zone.

Let's take a closer look at some of the lines of output above.

fast rss 265364KB

- The resident set size (physical memory used by the zone) calculated using the "fast" algorithm.

rss 242292KB, cap 262144KB, excess -19852KB
sleep 30 seconds

- The fast algorithm said it's over the cap, but the accurate algorithm says we are 19852KB under the cap. So we'll sleep 30 seconds and look again.

fast rss 398384KB
rss 374156KB, cap 262144KB, excess 112012KB

- Now, we're really over the cap. Time to scan processes.

pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd
pid 6735: unp 0 att 2552KB drss 0KB excess 125119KB
pid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql --  
basedir=/opt/local --datadir=/var/mysql -
pid 6833: unp 0 att 42032KB drss -544KB excess 124575KB
  • pid -> process id.
  • nmap -> number of segments in the address space of this process. See pmap(1).
  • sz -> Virtual address space size for the process.
  • rss -> Current amount of physical memory for the process.
  • The command line args for the process
  • unp -> unpageable mappings
  • att -> attempted amount of memory to free. Basically, it tries to free the segments of the address space.
  • drss -> delta RSS. The amount subtracted from the RSS (i.e., the amount freed)
  • excess -> current amount over (under if negative) the cap

Mike sees no threads swapped out because the system itself has plenty of free memory. Swapping only occurs when the entire system is memory stressed, not just some zone that has gone over its cap. If you don't have access to the global zone, you won't be able to see the mcap_debug.log file. However, you might be able to use prstat to see the RSS values for processes in your zone adjusting downward as zoneadmd does its work.

The mechanism described above is specific to SmartOS. Other Solaris and Solaris-like OSes do this differently.

Submit your DTrace, MDB, or SmartOS questions to be answered in next week's column by emailing MrBruning@joyent.com, ask at #BruningQuestions, or attend one of my upcoming courses.

Have a healthy and safe New Year!

Sign up Now For
Instant Cloud Access

Get Started