January 04, 2013 - by Mr. Max Bruning
I have a question from Mike Zeller (@papertigerss), who says:
I was playing around with some of the stuff from internals training. I noticed one of my zones at home was swapping against itself because of its cap. It seems like the mdb thread walk doesn't actually think those threads have swapped out in that zone. I was wondering if there was a way to get that info from the GZ or from within the zone itself.
metroid == GZ notch == zone [root@metroid ~]# zonememstat ZONE RSS(MB) CAP(MB) NOVER POUT(MB) global 128 - - - 6fef98f8-9b12-46ed-9e2f-c3aabba0d7d9 1060 2048 0 0 cdf60754-a789-40b5-8028-61b4553d4a78 2240 3072 6349 49740 [root@notch ~]# zonememstat ZONE RSS(MB) CAP(MB) NOVER POUT(MB) cdf60754-a789-40b5-8028-61b4553d4a78 2240 3072 6349 49740 [root@metroid ~]# mdb -k Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc pcplusmp scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs sd idm crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs sata ] > ::walk thread a | ::print -t kthread_t t_schedflag | ::grep "(.&1)==0" | ::eval "<a::print kthread_t t_procp | ::print proc_t p_user.u_psargs" >
Mike shows, using
mdb -k, that there are no threads that have been swapped out. So the
question is, what happens when the memory being used by a zone exceeds the cap?
To explain this, we'll create a zone with a small memory cap, then run an application which exceeds the cap to see what we can find out.
To follow along, you can download SmartOS from here and set up a zone by following http://wiki.smartos.org/display/DOC/How+to+create+a+zone+%28+OS+virtualized+machine+%29+in+SmartOS. Or you can provision a SmartMachine at http://joyent.com/. You can do the steps I'm going to go through in less than an hour so if you destroy the machine when you're done, the cost should be pennies. For this, the smallest SmartMachine size works fine. I set up a SmartMachine on SmartOS with 256MB.
# vmadm list -v UUID TYPE RAM STATE ALIAS 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 OS 256 running - fe0c65e1-0f61-4291-99e1-9d7a75dd091e OS 32768 running - # zlogin 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 [Connected to zone '03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3' pts/5] Last login: Fri Jan 4 15:32:29 on pts/2 __ . . _| |_ | .-. . . .-. :--. |- |_ _| ;| || |(.-' | | | |__| `--' `-' `;-| `-' ' ' `-' / ; SmartMachine (standard64 1.0.7) `-' http://wiki.joyent.com/jpc2/SmartMachine+Standard [root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]#
In general, memory capping is done by a daemon. If the daemon detects that a zone has gone over its cap, it decides on pages to push out to free up some of the memory in the zone. In SmartOS, we also cause page faults (pages being brought into memory) for the zone to be delayed (see the use of
zone_pg_flt_delay in the
as_fault() kernel routine, available . There is a long comment at the beginning of https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/zoneadmd/mcap.c which explains how things work. Maybe the most interesting part of the comment is at the end. You can turn on logging to see which processes are being examined and how much memory is being freed for each one. To do this, you need to be in the global zone.
# touch /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log
The mechanism used is slightly different between SmartOS and other solaris-like boxes. See http://smartos.org/2012/07/03/smartos-max-brunings-talk-at-nosig/ for a description of the mechanism on SmartOS versus other Solaris.
Here, we'll do something to go over the cap of 256MB.
[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M & [root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=128M & [root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# dd if=/dev/zero of=/dev/null bs=64M &
(Note the above is done in the 256MB zone)
Let's see what
[root@03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 ~]# zonememstat ZONE RSS(MB) CAP(MB) NOVER POUT(MB) 03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3 220 256 1 150
The zone has gone over its memory cap 1 time, and 150MB of memory were paged out.
We'll wait a few minutes, and take a look at the
mcap_debug.log file (from the global zone).
# cat /zones/03b1d3b0-7634-47c7-b6a4-c7e6d336c6b3/mcap_debug.log ... <-- output omitted fast rss 265364KB rss 242292KB, cap 262144KB, excess -19852KB <-- plenty of room under the cap sleep 30 seconds fast rss 189868KB rss 189868KB, cap 262144KB, excess -72276KB sleep 120 seconds phys-mcap-cmd: phys-mcap-no-vmusage: 0 phys-mcap-no-pageout: 0 current cap 262144KB lo 209715KB hi 235929KB fast rss 398384KB rss 374156KB, cap 262144KB, excess 112012KB <-- over the cap by 112012KB pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd <-- start scanning processes where we left off pid 6735: unp 0 att 2552KB drss 0KB excess 125119KB pid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql -- basedir=/opt/local --datadir=/var/mysql - pid 6833: unp 0 att 42032KB drss -544KB excess 124575KB pid 6377: nmap 32 sz 2136KB rss 1232KB /sbin/init pid 6377: unp 0 att 748KB drss -44KB excess 124531KB ... pid: 6320 system process, skipping zsched ... pid 6391: nmap 90 sz 10660KB rss 3084KB /lib/svc/bin/svc.configd <-- this is the last process (from ps -e) pid 6391: unp 0 att 3700KB drss -320KB excess 121619KB process pass done; excess 121619 <-- have scanned the rest of the processes from where we started fast rss 395064KB rss 372284KB, cap 262144KB, excess 110140KB <-- still over the cap starting to scan, excess 123247k <-- start scanning from the beginning pid 6701: nmap 25 sz 1932KB rss 1052KB /usr/lib/saf/ttymon -g -d /dev/console -l console -T vt100 -m ldterm,ttcompat - pid 6701: unp 0 att 948KB drss 0KB excess 123247KB pid 6469: nmap 58 sz 3912KB rss 1436KB /lib/inet/ipmgmtd pid 6469: unp 0 att 2368KB drss 0KB excess 123247KB pid 6686: nmap 17 sz 1632KB rss 992KB /usr/lib/utmpd pid 6686: unp 0 att 336KB drss -28KB excess 123219KB ... pid 7786: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128M <-- one of the dd's pid 7786: unp 0 att 131184KB drss -131184KB excess -8149KB <-- and we freed up most of the space apparently under; excess -8149 <-- now under the cap fast rss 394992KB rss 371348KB, cap 262144KB, excess 109204KB <-- but the dd is still running, so we're back over! pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd <-- continue where we left off pid 6735: unp 0 att 2552KB drss 0KB excess 122311KB pid 6833: nmap 61 sz 113940KB rss 20004KB /opt/local/sbin/mysqld --user=mysql -- basedir=/opt/local --datadir=/var/mysql - pid 6833: unp 0 att 42356KB drss -576KB excess 121735KB ... pid 7787: nmap 32 sz 134500KB rss 132584KB dd if=/dev/zero of=/dev/null bs=128M pid 7787: unp 0 att 131184KB drss -131184KB excess -9761KB apparently under; excess -9761 <-- and we're back under fast rss 261940KB rss 261940KB, cap 262144KB, excess -204KB <-- still under, but the dd's are eating it up again sleep 30 seconds <-- but we were under, so sleep for a bit fast rss 384872KB rss 362560KB, cap 262144KB, excess 100416KB <-- over again pid 6714: nmap 22 sz 2120KB rss 1252KB /usr/lib/saf/sac -t 300 pid 6714: unp 0 att 576KB drss -24KB excess 113499KB ...
So, if the zone is under the memory cap, and stays under the cap, a thread in
zoneadmd for the zone periodically checks to see if the zone is over the cap. It uses a "fast" algorithm, which is an approximation that may exceed the actual value of used memory (it may count pages shared between processes multiple times). If the fast algorithm shows the zone is over the cap, it uses a more expensive but more accurate algorithm to determine how much the zone is over its cap. The thread then starts scanning processes looking for memory that can be freed. It starts the scan from where it last left off. The order it scans the processes is the same as seen by doing
ps -e in the zone.
Let's take a closer look at some of the lines of output above.
fast rss 265364KB
- The resident set size (physical memory used by the zone) calculated using the "fast" algorithm.
rss 242292KB, cap 262144KB, excess -19852KB sleep 30 seconds
- The fast algorithm said it's over the cap, but the accurate algorithm says we are 19852KB under the cap. So we'll sleep 30 seconds and look again.
fast rss 398384KB rss 374156KB, cap 262144KB, excess 112012KB
- Now, we're really over the cap. Time to scan processes.
pid 6735: nmap 39 sz 4020KB rss 1152KB /usr/lib/ssh/sshd pid 6735: unp 0 att 2552KB drss 0KB excess 125119KB pid 6833: nmap 62 sz 113940KB rss 21092KB /opt/local/sbin/mysqld --user=mysql -- basedir=/opt/local --datadir=/var/mysql - pid 6833: unp 0 att 42032KB drss -544KB excess 124575KB
Mike sees no threads swapped out because the system itself has plenty of free memory. Swapping only occurs when the entire system is memory stressed, not just some zone that has gone over its cap. If you don't have access to the global zone, you won't be able to see the
mcap_debug.log file. However, you might be able to use prstat to see the RSS values for processes in your zone adjusting downward as
zoneadmd does its work.
The mechanism described above is specific to SmartOS. Other Solaris and Solaris-like OSes do this differently.
Have a healthy and safe New Year!