Kernel Memory Allocator 0xdeadbeef Checking

July 12, 2013 - by Mr. Max Bruning

Jason Banham of Nexenta asks me if I can go through an example of using kmem_flags for debugging kernel memory allocator problems. kmem_flags is defined in kmem_impl.h. Documentation exists in Debugging With the Kernel Memory Allocator, (in Modular Debugger Guide), and in Chapter 11.4 of "Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)", McDougall and Mauro, ISBN-13: 978-0131482098.

We'll start by looking at a panic dump that does not have kmem_flags set. Note that I wrote a little driver that causes the panic "on purpose" in order to have an example that I can use in this blog. I do this because, of course, SmartOS itself doesn't have any bugs...Code is available here. I'm doing this in the global zone on SmartOS.

# cd /var/crash/volatile
# ls
bounds    METRICS.csv   vmdump.1
# savecore -f vmdump.1
savecore: System dump time: Wed Jul 10 23:36:02 2013

savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1
Constructing namelist /var/crash/volatile/unix.1
Constructing corefile /var/crash/volatile/vmcore.1
 0:03 100% done: 88882 of 88882 pages saved
#
# mdb 1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
> ::status
debugging crash dump vmcore.1 (64-bit) from 00-0c-29-3d-b0-99
operating system: 5.11 joyent_20130529T211010Z (i86pc)
image uuid: (not set)
panic message: 
BAD TRAP: type=d (#gp General protection) rp=ffffff00055689f0 addr=ffffff0005568
b18
dump content: kernel pages only
>

So, the panic was caused by a general protection fault. If this was a user level core file, a general protection fault would either be a segmentation violation or a bus error. Let's take a look at the location where the panic occurred. We'll do this by examining the instruction pointer register.

>  <rip/i 
list_remove+0x1b:        movq  %rdx,(%rax)
>

This instruction is moving the contents of the rdx register into the location pointed to by the rax register. Let's see where rax points.

> <rax=K
        3f3f3f3f3f3f3f3f
>

That does not look like an address.

> <rax/K
mdb: failed to read data from target: no mapping for address
0x3f3f3f3f3f3f3f3f:
>

So, the list_remove() function is trying to write to an invalid address. Note that list_remove() is a documented function (see list_remove(9f)). While it is possible that the function has a bug, it is highly unlikely. More likely is that the caller to the function is passing in an invalid argument. In fact, the bad address looks like ascii.

> 3f3f3f3f3f3f3f3f=s
????????
>

The "address" is actually a set of question marks!? Looking at the call stack shows:

> $c
list_remove+0x1b(ffffff014f753d08, ffffff014ee6f840)
zil_itxg_clean+0x97(ffffff01515f4a10)
taskq_thread+0x2d0(ffffff014a887e80)
thread_start+8()
>

The zil_itxg_clean() function is part of the ZFS code. It's defined here. Let's take a closer look at the list_remove() function.

> list_remove::dis
list_remove:          pushq %rbp
list_remove+1:         movq  %rsp,%rbp
list_remove+4:         subq  $0x10,%rsp
list_remove+8:         movq  %rsi,-0x10(%rbp)
list_remove+0xc:        addq  0x8(%rdi),%rsi
list_remove+0x10:        movq  %rdi,-0x8(%rbp)
list_remove+0x14:        movq  0x8(%rsi),%rax
list_remove+0x18:        movq  (%rsi),%rdx
list_remove+0x1b:        movq  %rdx,(%rax)
list_remove+0x1e:        movq  (%rsi),%rax
list_remove+0x21:        movq  0x8(%rsi),%rdx
list_remove+0x25:        movq  %rdx,0x8(%rax)
list_remove+0x29:        movq  $0x0,0x8(%rsi)
list_remove+0x31:        movq  $0x0,(%rsi)
list_remove+0x38:        addq  $0x10,%rsp
list_remove+0x3c:        leave 
list_remove+0x3d:        ret  
>

For those unfamiliar with x86 64-bit assembler code, the first 6 arguments to functions are placed in registers. The list_remove() function takes 2 arguments. The first argument to functions are put by the caller into the rdi register, the second argument is put into the rsi register. (Additional arguments, if any, are placed into other registers. If there are more than 6 arguments, the additional arguments after the sixth are pushed on the stack. A complete description of this is beyond the scope of this blog post.)

Recall that the panic occurs at list_remove+0x1b when moving data into the location specified by the rax register. From examining the above code, the rax register is initialized at list_remove+0x14 from 8 bytes offset from the location specified in the rsi register. The rsi register is added with the contents of the rdi register, offset by 8 bytes at list_remove+0xc. The rdi register contains the first argument to list_remove(), which should be a list_t *. Let's look at the list_t and the rsi register.

> <rdi::print -a -t list_t
ffffff014f753d08 list_t {
  ffffff014f753d08 size_t list_size = 0x50
  ffffff014f753d10 size_t list_offset = 0
  ffffff014f753d18 struct list_node list_head = {
    ffffff014f753d18 struct list_node *list_next = 0xffffff014ee6f840
    ffffff014f753d20 struct list_node *list_prev = 0xffffff014ee70300
  }
}
> <rsi=K
        ffffff014ee6f840
>

So, the list_remove() routine adds the list_offset (which is the location in each member of the list of the pointer to the next element in the list, and in this case, the "next" pointer is 0, so it is at the beginning of each element) to the member of the list, and uses it to walk the members looking for a match on the element being removed. Let's take a look at the element pointed to by rsi. This should be an itx_t * (from looking at the code in zil_itxg_clean()).

> <rsi::print -a -t itx_t
ffffff014ee6f840 itx_t {
  ffffff014ee6f840 list_node_t itx_node = {
    ffffff014ee6f840 struct list_node *list_next = 0x3f3f3f3f3f3f3f3f
    ffffff014ee6f848 struct list_node *list_prev = 0x3f3f3f3f3f3f3f3f
  }
  ffffff014ee6f850 void *itx_private = 0x3f3f3f3f3f3f3f3f
  ffffff014ee6f858 itx_wr_state_t itx_wr_state = 0t1061109567 (???)
  ffffff014ee6f85c uint8_t itx_sync = 0x3f
  ffffff014ee6f860 uint64_t itx_sod = 0x3f3f3f3f3f3f3f3f
  ffffff014ee6f868 uint64_t itx_oid = 0x3f3f3f3f3f3f3f3f
  ffffff014ee6f870 lr_t itx_lr = {
    ffffff014ee6f870 uint64_t lrc_txtype = 0x3f3f3f3f3f3f3f3f
    ffffff014ee6f878 uint64_t lrc_reclen = 0x3f3f3f3f3f3f3f3f
    ffffff014ee6f880 uint64_t lrc_txg = 0x3f3f3f3f3f3f3f3f
    ffffff014ee6f888 uint64_t lrc_seq = 0x3f3f3f3f3f3f3f3f
  }
}
>

So, it looks like the element on the list (an itx_t *) is pointing at data that contains a lot of question marks (0x3f is the ascii code for a question mark). In other words, the list has been corrupted.

Rather than spend a lot of time trying to figure out how this could have happened, we'll reboot the system and set kmem_flags to 0xf. This will turn on some debugging flags for the kernel memory allocator. Specifically, this turns on "deadbeef" checking, "redzone" checking, and "auditing". Note that some documentation says to set kmem_flags to 0x1f. Doing that on SmartOS gives a warning message that the value is not supported. Also, I found the resulting dumps to be much more difficult to analyze. (In fact, the resulting panics did not seem to have anything to do with the normal checking that deadbeef and redzone is meant to handle).

  • Deadbeef checking will cause the kernel memory allocator to fill in kmem_free(9f)'d data with the hex pattern0xdeadbeef. When data is allocated from a free list, the memory allocator checks to see if it is all 0xdeadbeef. If not, it causes the system to panic. This is useful for finding memory that is being used after it is freed.
  • Redzone checking tells the memory allocator to add the hex pattern 0xfeedface for 64 bytes at the end of any dynamically allocated memory. When the allocated memory is freed, the memory allocator checks to see if the 0xfeedface pattern is still there. If not, it causes the system to panic. This is useful in finding situations where code has written past the end of the dynamically allocated memory.
  • Auditing will save a stack trace for each kernel memory allocation and free. We'll see how that can be used shortly.

There are a few ways to turn on the kmem_flags variable. We could set it in /etc/system, but since SmartOS uses a ram disk for the root file system, it will not be persistent across boots. So, we'll do it at boot time. One of the choices from grub at boot time is:

Live 64-bit (text) +kmdb

When the system is coming up, we'll choose that option for booting. On the console, you'll see:

Loading kmdb...

Welcome to kmdb
Loaded modules: [ unix krtld genunix ]
[0]>

Here, we'll enter:

[0]> kmem_flags/W 0x1f
kmem_flags:   0      =     0x1f
[0]> :c

The system will print a warning about overhead using kmem_flags, but should come up (unless it runs into a problem with dynamically allocated memory on the way up).

Once the system is up, we'll try to reproduce the problem. After about 30 seconds (in this case), the system panics. (Without the kmem_flags setting, the system could run for a long time without hitting the problem). When the system panics, because kmdb has been loaded, it will drop into kmdb(1). We'll just have it continue which will give us a crash dump. On the console:

kmdb: target stopped at:
kmdb_enter+0xb: movq %rax,%rdi
[0]> :c
syncing file systems...

And the system should create a dump and reboot. (Note that I had to do this twice, as the first time, the system paniced again when creating the dump).

# cd /var/crash/volatile
# ls
bounds    METRICS.csv unix.1    vmcore.1   vmdump.1 vmdump.2
#
# savecore -f vmdump.2
savecore: System dump time: Thu Jul 11 22:01:09 2013

savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2
Constructing namelist /var/crash/volatile/unix.2
Constructing corefile /var/crash/volatile/vmcore.2
 0:02 100% done: 164627 of 164627 pages saved
#
# mdb 2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs lofs idm mpt crypto random sd cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
> 

Let's see how it paniced this time.

> ::status
debugging crash dump vmcore.2 (64-bit) from 00-0c-29-3d-b0-99
operating system: 5.11 joyent_20130529T211010Z (i86pc)
image uuid: (not set)
panic message: kernel heap corruption detected
dump content: kernel pages only
>

The panic message: kernel heap corruption detected is typically the result of deadbeef checking. When the kernel panics from kmem_flags settings, information is sent to the console. Let's get that information.

> ::msgbuf
MESSAGE                                
... <- output omitted
kernel memory allocator: 
buffer modified after being freed
modification occurred at offset 0x0 (0xdeadbeefdeadbeef replaced by 0x3f3f3f3f3f
3f3f3f)
buffer=ffffff01513301c0 bufctl=ffffff01517780f0 cache: kmem_alloc_256
previous transaction on buffer ffffff01513301c0:
thread=ffffff0005c76c40 time=T-21.196117909 slab=ffffff0151351710 cache: kmem
_alloc_256
kmem_cache_free_debug+10f
kmem_cache_free+123
kmem_free+4e
zil_itxg_clean+ab
taskq_thread+2d0
thread_start+8

panic[cpu0]/thread=ffffff0004cc1c40: 
kernel heap corruption detected


ffffff0004cc0e00 fffffffffba28064 ()
ffffff0004cc0e90 genunix:kmem_cache_alloc_debug+1f7 ()
ffffff0004cc0ef0 genunix:kmem_cache_alloc+d4 ()
ffffff0004cc0f30 genunix:kmem_alloc+4b ()
ffffff0004cc0f60 genunix:getrbuf+18 ()
ffffff0004cc1030 scsi:scsi_alloc_consistent_buf+191 ()
... <-- output omitted
ffffff0004cc1990 zfs:zio_execute+88 ()
ffffff0004cc19f0 zfs:zio_notify_parent+db ()
ffffff0004cc1a30 zfs:zio_ready+eb ()
ffffff0004cc1a70 zfs:zio_execute+88 ()
ffffff0004cc1b30 genunix:taskq_thread+2d0 ()
ffffff0004cc1b40 unix:thread_start+8 ()

syncing file systems...
 done
dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
>

At this point, the buffer we are interested in, buffer=ffffff01513301c0 is being allocated by getrbuf called indirectly from ZFS code, and it tells us the most recent transaction on the buffer, which was zil_itxg_clean() freeing the buffer. Let's use the auditing flag we turned on to see what allocates and frees this buffer. Note that this gave me ~8000 lines of output, so it has been mostly omitted.

> ffffff01513301c0::kmalog

T-21.196116669 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_free_debug+0x10f
     kmem_cache_free+0x123
     kmem_free+0x4e
     zil_itxg_clean+0xab
     taskq_thread+0x2d0
     thread_start+8

T-23.383043939 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_alloc_debug+0x2e8
     kmem_cache_alloc+0xd4
     kmem_alloc+0x4b
     zil_itx_create+0x33
     zfs_log_write+0x109
     zfs_write+0xb0e
     fop_write+0x8b
     write+0x250
     write32+0x1e

T-31.201836987 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_free_debug+0x10f
     kmem_cache_free+0x123
     kmem_free+0x4e
     zil_itxg_clean+0xab
     taskq_thread+0x2d0
     thread_start+8

T-34.183272209 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_alloc_debug+0x2e8
     kmem_cache_alloc+0xd4
     kmem_alloc+0x4b
     zil_itx_create+0x33
     zfs_log_write+0x109
     zfs_write+0xb0e
     fop_write+0x8b
     write+0x250
     write32+0x1e

...
T-78.879496922 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_free_debug+0x10f
     kmem_cache_free+0x123
     kmem_free+0x4e
     kbad_bad+0x4b
     kbad_ioctl+0x2b
     cdev_ioctl+0x39
     spec_ioctl+0x60
     fop_ioctl+0x55
     ioctl+0x9b

T-78.879497144 addr=ffffff01513301c0 kmem_alloc_256
     kmem_cache_alloc_debug+0x2e8
     kmem_cache_alloc+0xd4
     kmem_alloc+0x4b
     kbad_bad+0x17
     kbad_ioctl+0x2b
     cdev_ioctl+0x39
     spec_ioctl+0x60
     fop_ioctl+0x55
     ioctl+0x9b
...

The above gives lots of output. Let's try looking at bufctl=ffffff01517780f0 (from the ::msgbuf output).

We'll get all transactions for the specific buffer and bufctl, and just get unique entries.

> ffffff01517780f0::bufctl -h -a ffffff01513301c0 ! cut -d " " -f 7 | sort -u

door_args+0x261
door_call+0x1b5
dv_clone_mknod+0x118
dv_clone_mknod+0x9f
freerbuf+0x25
getrbuf+0x18
kbad_bad+0x17
kbad_bad+0x4b
pageio_done+0x5b
pageio_setup+0x5d
resolve_pathname+0x29a
resolve_pathname+0xd4
segvn_free+0x64
segvn_vpage+0x5f
vdev_disk_io_intr+0x47
vdev_disk_io_start+0x53
zil_itx_create+0x33
zil_itxg_clean+0xab
>

If we look at all of the output (i.e., leave off the sort -u), most of these alloc/free sequences occur more than once. We'll start with ones that occur only one time and examine the code to see if we can find where the pointer to the allocated data is used after it is freed. So, door_args+0x261 and door_call+0x1b5 occurs multiple times so we'll skip it. But dv_clone_mknod+0x118 and <dv_clone_mknod+0x9f only occurs once, so we could take a look at that code to see if the address returned from the allocation done at dv_clone_mknod+0x9f is used after it is freed. If we get through all of the alloc/free sequences that occur only once, and still have not found the bug, we could then look at the ones that occur more than one time. A "defensive" (and probably easier) tactic would be to go through the above routines that are freeing the buffer and set the pointer to NULL after the free. Something like:

kmem_free(addr, size);
addr = 0;

This will cause a panic when the pointer (addr) is used.

Rather than sift through all of the above examples, we'll look at one that seems obvious: kbad_bad. (Note that this is part of code I wrote specifically to cause the bug. Remember, there are no bugs in SmartOS :) ).

void
kbad_bad(void)
{
    char *p;
    int nticks;

    p = kmem_alloc(KBADSIZE, KM_SLEEP);
    if (!p) {
        cmn_err(CE_WARN, "kbad_bad: out of memory\n");
        return;
    }
    kmem_free(p, KBADSIZE);

    (void) timeout(kbad_timeout_handler, p, DELAYTIME);

    return;
}

So, the routine allocates memory, frees it, and sets a timer (see timeout(9f)). When the timer goes off, kbad_timeout_handler() gets called with a pointer to the space that was just freed up as an argument.

static void
kbad_timeout_handler(void *arg)
{
    int i;
    char *p = (char *)arg;
    for (i = 0; i < KBADSIZE; i++)
        *p++ = '?';
}

And there is the bug. The kbad module uses the pointer after it is freed.

There are several things to note about this example:

  • Most of the time, the data will not be overwritten with question marks, but instead will be something that looks like a data structure or pointer that points into who knows what. This may or may not make the problem easier.
  • There are equivalent flags for the libumem(3lib) library. See umem_debug(3MALLOC) for details. You can use LD_PRELOAD=libumem cmd args... to cause the libumem versions of malloc/free to be used instead of the versions with which the code was built. (But not always. Some versions of malloc() take different arguments).
  • It will usually not be this easy.

Sign up Now For
Instant Cloud Access

Get Started