Memory Capping on SmartOS

Marco Spadoni of Libero asks a couple of questions.

"I have a SmartVM that goes in overcap, and consequently I can see some MB that have been paged out. The question is why I do not find any trace of this in vmstat output?"

And:

"could you please, at your best convenience, also explain why, even if in the swap command man page it is stated that the "-l" option does not include physical memory (whilst the "-s" option does), it seems that each invocation returns the physical memory contained…"

A "SmartVM" is a virtualized OS instance running on SmartOS (i.e., a zone). The short answer to the first question is that you won't see anything regarding pages of an individual zone being paged out in vmstat(1M) output unless the entire machine is memory stressed. And even then, vmstat will only show paging that is being done by the pageout() daemon (see vm_pageout.c for the code and a long description of how paging works on SmartOS). I wrote a bit about how memory capping works on SmartOS here.

Marco includes the following output from kstat(1M) showing the zone in question:

# kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5module: memory_cap                      instance: 6name:   03e65382-7f67-4791-a0e2-0c5fc5  class:    zone_memory_cap        anon_alloc_fail                 0        anonpgin                        90678        crtime                          416.87107576        execpgin                        97        fspgin                          11757        n_pf_throttle                   55308        n_pf_throttle_usec              7926500        nover                           14        pagedout                        2806816768        pgpgin                          102532        physcap                         2147483648        rss                             2091122688        snaptime                        8501413.44077987        swap                            3413041152        swapcap                         4294967296        zonename    03e65382-7f67-4791-a0e2-0c5fc5cdb8d6#

Note that the zone has gone over its memory cap 14 times, and has paged out a total of 2806816768 bytes. At some time later, it has gone over the cap 4 more times:

# kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5kstat memory_cap:6:03e65382-7f67-4791-a0e2-0c5fc5module: memory_cap                      instance: 6name:   03e65382-7f67-4791-a0e2-0c5fc5  class:    zone_memory_cap        anon_alloc_fail                 0        anonpgin                        109303        crtime                          416.87107576        execpgin                        97        fspgin                          11757        n_pf_throttle                   310670        n_pf_throttle_usec              52254000        nover                           18        pagedout                        3218251776        pgpgin                          121157        physcap                         2147483648        rss                             2116661248        snaptime                        8506139.7862525        swap                            3434885120        swapcap                         4294967296        zonename                        03e65382-7f67-4791-a0e2-0c5fc5cdb8d6#

The difference in pagedout is ~392MB. Marco then shows the following vmstat(1M)output:

# vmstat -p     memory           page          executable      anonymous      filesystem   swap  free  re  mf  fr  de  sr  epi  epo  epf  api  apo  apf  fpi  fpo  fpf 103635632 5777488 417 3117 0 0 0    0    0    0    0    0    0    0    0    0# vmstat -S kthr      memory            page            disk          faults      cpu r b w   swap  free  si  so pi po fr de sr rm s0 -- --   in   sy   cs us sy id 0 0 0 103635632 5777484 0 0 0  0  0  0  0 -1159 55 0 0 3544 13732 1904 3 1 96# vmstat kthr      memory            page            disk          faults      cpu r b w   swap  free  re  mf pi po fr de sr rm s0 -- --   in   sy   cs us sy id 0 0 0 103635632 5777484 417 3117 0 0 0 0 0 -1159 55 0 0 354413732 1904 3 1 96

Marco says he does not find anything about the zone that went over its memory cap in the vmstat(1M) output. However, like most of the *stat commands, the first line of output is an average since boot. To get multiple lines of output, you need to specify an interval in seconds, for instance:

# vmstat 2 kthr      memory            page            disk          faults      cpu r b w   swap  free  re  mf pi po fr de sr lf rm s0 s1   in   sy   cs us sy id 0 0 0 2532416 332324 4  32  3  0  0  0  2  0 -139 0 537 319 174737 338 22 4 74 1 0 0 2406184 199776 4  25 10  0  0  0  0 97  3  1  0  328  176  440 99  1  0 1 0 0 2406104 199696 0   1  0  0  0  0  0  0  0  0  0  304  156  220 100 0  0...

This gives output every 2 seconds. The first line (after the header) is average since boot. The next lines are over the previous 2 seconds.

But even when specifying an interval, it is very likely that no evidence of the zone that is going over its memory cap would show up in this output. Unless the system as a whole is running short of free memory, vmstat will not show paging activity. Memory capping for a zone is done by a thread in zoneadmd. This thread uses the memcntl(2) system call to page out pages belonging to processes within the over-capped zone. Better to use zonememstat(1M) to see per-zone memory usage.

As to Marco's second question regarding swap space. Let's take a look at the output of swap -l in the global zone, then in a non-global zone. (This is not done on the same machine Marco was using, so sizes are different from his machine).

# swap -lhswapfile             dev    swaplo   blocks     free/dev/zvol/dsk/zones/swap 90,1        4K     2.0G     2.0G#

In the global zone, there is 2.0GB of swap space on 1 swap device. In a non-global zone:

# swap -lhswapfile             dev    swaplo   blocks     freeswap                  -         4K     512M     491M#

And the amount of memory in this zone is:

# prtconf | headprtconf: devinfo facility not availableSystem Configuration:  Joyent  i86pcMemory size: 512 MegabytesSystem Peripherals (Software Nodes):#

So, the amount of swap space in the non-global zone is equal to the size of the physical memory of that zone. But according to swap(1M) for the "-l" option:

         "The list does not include swap space in the form of  physical         memory because this space is not associated with a particular         swap area."

The output in the non-global zone looks like the amount of swap space for the zone is equivalent to the memory cap on the zone. To understand what is going on, we'll look at the source code for the swap(1M) command. This is in swap.c. For the "-l" option, the list() function is called (line 366 in swap.c). This function calls the swapctl(2) system call twice. The first time to get the number of swap devices/files, and the second time to get the sizes for those devices. The code for swapctl(2) is at vm_swap.c. In that file, a comment for the swapctl(2) call that retrieves the number of swap devices/files on the system says:

/* * When running in a zone we want to hide the details of the swap * devices: we report there only being one swap device named "swap" * having a size equal to the sum of the sizes of all real swap devices * on the system. */

So, in a non-global zone, the swap(1M) command reports only 1 swap device, regardless of the number of swap devices/files configured on the system. In the vm_swap.c file, starting at line 605, is the following (note that line numbers may change over time):

if (zp->zone_max_swap_ctl != UINT64_MAX) {    rctl_qty_t cap, used;    mutex_enter(&zp->zone_mem_lock);    cap = zp->zone_max_swap_ctl;    used = zp->zone_max_swap;    mutex_exit(&zp->zone_mem_lock)";    st.ste_length = MIN(cap, st.ste_length);    st.ste_pages = MIN(btop(cap), st.ste_pages);    st.ste_free = MIN(st.ste_pages - btop(used),        st.ste_free);}

This is part of the code that retrieves the size of the swap device. If zone_max_swap_ctl is UINT64_MAX, the size comes from the data structures that the kernel uses to manage swap space. If zone_max_swap_ctl is not equal to UINT64_MAX, the size of swap comes from the zone_max_swap_ctl variable. In the case of a non-global zone with a memory cap (as in the case Marco is asking about), the zone_max_swap_ctl variable is not equal to UINT64_MAX (see the following output from DTrace. You didn't think I would go through a technical post without using DTrace, did you?).

# dtrace -n ';swapctl:entry/execname ==";swap" \  && stringof(((proc_t *)curpsinfo->pr_addr)->p_zone->zone_name) != "global"/ \  {printf("zone_max_swap_ctl = %d\n", \  ((proc_t *)curpsinfo->pr_addr)->p_zone->zone_max_swap_ctl);\  exit(0);}';dtrace: description ';swapctl:entry'; matched 1 probeCPU     ID                    FUNCTION:NAME  0  20432                    swapctl:entry zone_max_swap_ctl = 536870912#

The value of UINT64_MAX is much larger (18446744073709551615) than 536870912 (512MB), so the swap cap for the zone is reported.

# zonecfg -z 003f53ff-600b-44be-bae3-ca3f84aa5a8a info capped-memorycapped-memory:      [physical: 512M]        [swap: 512M]    [locked: 512M]#

I wish to thank Marco for the questions. Maybe sometime soon I'll take a look at the output of swap -sh.



Post written by Mr. Max Bruning