Walking the Linux Kernel Stanislav Kozina Associate Manager April 2016 2 AGENDA Crash it with systemtap See what we can get from the crash dump Prepare the system for crashing Linux Kernel in general Debugging kernel issues without crash 3 Linux Kernel in general – boring ● Just software… ● Written in C & asm ● List of expected features ● Boot and initialization process ● Memory and process management ● Hardware abstraction ● Files, directories, sockets, … ● Resources abstraction ● CPU, memory, … ● POSIX 4 Linux Kernel in general – BUT! ● Quite big (20mil LOC) ● No libc (many other standards functions instead) ● Special environment ● Preemptive, shared memory space ● Early boot code is tricky ● No dynamic allocator, AP, scheduler, even locks! ● Special security requirements ● Kernel should not just die and/or leak anything 5 Linux Kernel in general – development system ● Open source ● List of maintainers in MAINTAINERS ● Patches posted via email ● Documentation/SubmittingPatches ● LKML ● Usually companies care about support of their stuff ● Hardware vendors… ● “We don't break userland” DIGGING IN 7 Observing kernel is not trivial ● Hard to get a consistent picture ● If we stop it, how we observe it? ● Using Vms? ● printk() ● Statistics, /proc, perf, strace, ftrace, ... 8 9 Debugging kernel issues ● Oops messages ● Message buffer, registers, stack/backtrace ● Oops leaves the system running, but unstable! ● Current task is killed ● Printk() ● Systemtap, ftrace ● crash ● Sysrq triggers 10 Kernel oops [ 32.580355] SysRq : Trigger a crash [ 32.581331] BUG: unable to handle kernel NULL pointer dereference at [ 32.582703] IP: [] sysrq_handle_crash+0x16/0x20 [ 32.583781] PGD 3b503067 PUD 3b502067 PMD 0 [ 32.584609] Oops: 0002 [#1] SMP [ 32.585210] Modules linked in: ip6t_rpfilter (... ) [ 32.598062] CPU: 1 PID: 2370 Comm: bash Not tainted 3.10.0- 327.el7.x86_64 #1 [ 32.598923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 32.600020] task: ffff88003b544500 ti: ffff88003b518000 task.ti: ffff88003b518000 [ 32.580355] SysRq : Trigger a crash [ 32.581331] BUG: unable to handle kernel NULL pointer dereference at [ 32.582703] IP: [] sysrq_handle_crash+0x16/0x20 [ 32.583781] PGD 3b503067 PUD 3b502067 PMD 0 [ 32.584609] Oops: 0002 [#1] SMP [ 32.585210] Modules linked in: ip6t_rpfilter (... ) [ 32.598062] CPU: 1 PID: 2370 Comm: bash Not tainted 3.10.0- 327.el7.x86_64 #1 [ 32.598923] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 32.600020] task: ffff88003b544500 ti: ffff88003b518000 task.ti: ffff88003b518000 11 Kernel oops 2 [ 32.600875] RIP: 0010:[] [] sysrq_handle_crash+0x16/0x20 [ 32.601864] RSP: 0018:ffff88003b51be80 EFLAGS: 00010046 [ 32.602469] RAX: 000000000000000f RBX: ffffffff81a06220 RCX: 0000000000000000 [ 32.603281] RDX: 0000000000000000 RSI: ffff88003fd0d6c8 RDI: 0000000000000063 [ 32.604092] RBP: ffff88003b51be80 R08: 0000000000000092 R09: 0000000000000268 [ 32.606518] FS: 00007fa978f3b740(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000 [ 32.607431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 32.600875] RIP: 0010:[] [] sysrq_handle_crash+0x16/0x20 [ 32.601864] RSP: 0018:ffff88003b51be80 EFLAGS: 00010046 [ 32.602469] RAX: 000000000000000f RBX: ffffffff81a06220 RCX: 0000000000000000 [ 32.603281] RDX: 0000000000000000 RSI: ffff88003fd0d6c8 RDI: 0000000000000063 [ 32.604092] RBP: ffff88003b51be80 R08: 0000000000000092 R09: 0000000000000268 [ 32.606518] FS: 00007fa978f3b740(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000 [ 32.607431] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 12 Kernel oops 3 [ 32.613426] Call Trace: [ 32.613724] [] __handle_sysrq+0xa2/0x170 [ 32.614359] [] write_sysrq_trigger+0x2f/0x40 [ 32.615049] [] proc_reg_write+0x3d/0x80 [ 32.615685] [] vfs_write+0xbd/0x1e0 [ 32.616286] [] ? trace_do_page_fault+0x43/0x110 [ 32.616999] [] SyS_write+0x7f/0xe0 [ 32.617581] [] system_call_fastpath+0x16/0x1b [ 32.613426] Call Trace: [ 32.613724] [] __handle_sysrq+0xa2/0x170 [ 32.614359] [] write_sysrq_trigger+0x2f/0x40 [ 32.615049] [] proc_reg_write+0x3d/0x80 [ 32.615685] [] vfs_write+0xbd/0x1e0 [ 32.616286] [] ? trace_do_page_fault+0x43/0x110 [ 32.616999] [] SyS_write+0x7f/0xe0 [ 32.617581] [] system_call_fastpath+0x16/0x1b 13 Kernel oops ● Good enough when ● The problem is in a direct callpath ● Not sufficient when ● We need to see more context ● Other tasks ● Structure content ● Be aware of -fomit-frame-pointer ● Reliable address only if %rsp == %rbp + sizeof (long) ● CONFIG_FRAME_POINTER 14 Getting crash dump ● Need to configure kdump ● Failsafe crash kernel loaded in a memory ● Switch to the new kernel via kexec ● Run kdump to save to content of /proc/vmcore # echo c > /proc/sysrq-trigger# echo c > /proc/sysrq-trigger 15 Opening a crash dump ● Dump of all kernel memory ● Zero pages, caches, buffers and userland are not dumped by default ● Can be opened in gdb ● But we need DWARF debugging symbols ● Kernel-debuginfo packages in Fedora/CentOS/RHEL ● Crash(8) tool ● Can run on live system through /proc/kcore as well DEMO 17 Using crash(8) ● mod -S ● set hex ● sys ● log ● mount ● swap ● mach [-m | -c] ● net 18 Linux tasks ● include/linux/sched.h: struct task_struct {} ● clone(2) ● CLONE_THREAD: ● “Since Linux 2.4, calls to getpid(2) return the TGID of the caller.” 19 task_struct struct task_struct { long state; void *stack; struct sched_entity se; struct mm_struct *mm; int exit_code; pid_t pid, tgid; struct task_struct *parent; const struct cred __rcu *cred; char comm[TASK_COMM_LEN]; struct files_struct *files; } struct task_struct { long state; void *stack; struct sched_entity se; struct mm_struct *mm; int exit_code; pid_t pid, tgid; struct task_struct *parent; const struct cred __rcu *cred; char comm[TASK_COMM_LEN]; struct files_struct *files; } 20 Process states in Linux ● ps ● Walk task_struct in the kernel ● -u | -k | -G -- filter only user|kernel|thread group leader tasks ● ps -p | -c -- walk parent/child relationship ● ps -S -- summary of task states D R uninterruptable sleep S woken / signal woken interruptable sleep Tsignal Z exit https://idea.popcount.org/2012-12-11-linux-process-states/ 21 Process tree ● ps -p crash> ps -p 6864 PID: 0 TASK: ffffffff81c124c0 CPU: 0 COMMAND: "swapper/0" PID: 1 TASK: ffff88007c888000 CPU: 0 COMMAND: "systemd" PID: 784 TASK: ffff8800369fbc00 CPU: 0 COMMAND: "login" PID: 2884 TASK: ffff880036a19e00 CPU: 0 COMMAND: "bash" PID: 6864 TASK: ffff880036850000 CPU: 0 COMMAND: "cat" crash> ps -p 6864 PID: 0 TASK: ffffffff81c124c0 CPU: 0 COMMAND: "swapper/0" PID: 1 TASK: ffff88007c888000 CPU: 0 COMMAND: "systemd" PID: 784 TASK: ffff8800369fbc00 CPU: 0 COMMAND: "login" PID: 2884 TASK: ffff880036a19e00 CPU: 0 COMMAND: "bash" PID: 6864 TASK: ffff880036850000 CPU: 0 COMMAND: "cat" DEMO 23 Backtrace ● bt -s #8 [ffff88007abefd88] ping_v4_seq_show+0x5 at ffffffff816f6a15 #9 [ffff88007abefd98] seq_read+0xec at ffffffff81246b7c #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 #12 [ffff88007abefed0] vfs_read+0x83 at ffffffff81223e43 #13 [ffff88007abeff08] sys_read+0x55 at ffffffff81224b95 #14 [ffff88007abeff50] entry_SYSCALL_64_fastpath+0x12 at ffffffff81781e2e #8 [ffff88007abefd88] ping_v4_seq_show+0x5 at ffffffff816f6a15 #9 [ffff88007abefd98] seq_read+0xec at ffffffff81246b7c #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 #12 [ffff88007abefed0] vfs_read+0x83 at ffffffff81223e43 #13 [ffff88007abeff08] sys_read+0x55 at ffffffff81224b95 #14 [ffff88007abeff50] entry_SYSCALL_64_fastpath+0x12 at ffffffff81781e2e 24 Full stack dump ● bt -fs ffff88007abefe10: 0000000000020000 0000000000020000 ffff88007abefe20: ffff88007abefe40 ffffffff81290132 #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 ffff88007abefe30: ffff88007c015b00 ffff88007abeff18 ffff88007abefe40: ffff88007abefec8 ffffffff81223387 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 ffff88007abefe50: 0000000000020000 ffff88007c015b00 ffff88007abefe60: ffff88007c015b10 ffff880036575148 ffff88007abefe10: 0000000000020000 0000000000020000 ffff88007abefe20: ffff88007abefe40 ffffffff81290132 #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 ffff88007abefe30: ffff88007c015b00 ffff88007abeff18 ffff88007abefe40: ffff88007abefec8 ffffffff81223387 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 ffff88007abefe50: 0000000000020000 ffff88007c015b00 ffff88007abefe60: ffff88007c015b10 ffff880036575148 25 Full stack dump ● bt -fs ffff88007abefe10: 0000000000020000 0000000000020000 ffff88007abefe20: ffff88007abefe40 ffffffff81290132 #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 ffff88007abefe30: ffff88007c015b00 ffff88007abeff18 ffff88007abefe40: ffff88007abefec8 ffffffff81223387 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 ffff88007abefe50: 0000000000020000 ffff88007c015b00 ffff88007abefe60: ffff88007c015b10 ffff880036575148 ffff88007abefe10: 0000000000020000 0000000000020000 ffff88007abefe20: ffff88007abefe40 ffffffff81290132 #10 [ffff88007abefe28] proc_reg_read+0x42 at ffffffff81290132 ffff88007abefe30: ffff88007c015b00 ffff88007abeff18 ffff88007abefe40: ffff88007abefec8 ffffffff81223387 #11 [ffff88007abefe48] __vfs_read+0x37 at ffffffff81223387 ffff88007abefe50: 0000000000020000 ffff88007c015b00 ffff88007abefe60: ffff88007c015b10 ffff880036575148 26 Disassemble crash> dis proc_reg_read 0xffffffff812900f0 : nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff812900f5 : push %rbp 0xffffffff812900f6 : xor %r8d,%r8d 0xffffffff812900f9 : mov %rsp,%rbp 0xffffffff812900fc : push %r12 0xffffffff812900fe : push %rbx crash> dis proc_reg_read 0xffffffff812900f0 : nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff812900f5 : push %rbp 0xffffffff812900f6 : xor %r8d,%r8d 0xffffffff812900f9 : mov %rsp,%rbp 0xffffffff812900fc : push %r12 0xffffffff812900fe : push %rbx DEMO 28 mm_struct ● task_struct -> mm struct mm_struct { struct vm_area_struct *mmap; /* list of VMAs */ struct rb_root mm_rb; pgd_t * pgd; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; } struct mm_struct { struct vm_area_struct *mmap; /* list of VMAs */ struct rb_root mm_rb; pgd_t * pgd; unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; } 29 vm_area_struct ● mm_struct -> mmap struct vm_area_struct { unsigned long vm_start; unsigned long vm_end; struct file * vm_file; struct anon_vma * anon_vma_chain; } struct vm_area_struct { unsigned long vm_start; unsigned long vm_end; struct file * vm_file; struct anon_vma * anon_vma_chain; } 30 Walk the memory map of a process ● pmap(1) ● task_struct mm mmap vm_start→ → → $ pmap $(pgrep bash) crash> list -o vm_area_struct.vm_next $ pmap $(pgrep bash) crash> list -o vm_area_struct.vm_next DEMO 32 Page tables https://www.kernel.org/doc/gorman/html/understand/understand006.html 33 Page tables (2) ● mm_struct->pgd crash> rd -64 0xffff88007bed9000 1024 ffff88007bed9000: 0000000000000000 0000000000000000 ................ ffff88007bed9010: 0000000000000000 0000000000000000 ................ ffff88007bed9020: 0000000000000000 0000000000000000 ................ ffff88007bed9030: 0000000000000000 0000000000000000 ................ crash> rd -64 0xffff88007bed9000 1024 ffff88007bed9000: 0000000000000000 0000000000000000 ................ ffff88007bed9010: 0000000000000000 0000000000000000 ................ ffff88007bed9020: 0000000000000000 0000000000000000 ................ ffff88007bed9030: 0000000000000000 0000000000000000 ................ DEMO 35 kmem ● [-i] -- overview ● [-p] -- Page info, walks mem_mam ● [-s] -- Slab statistic crash> kmem -s ext4_inode_cache CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff880079fa1000 ext4_inode_cache 1016 48728 48728 6091 8k crash> struct ext4_inode_info | tail -n 1 SIZE: 0x3f8 crash> kmem -s ext4_inode_cache CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff880079fa1000 ext4_inode_cache 1016 48728 48728 6091 8k crash> struct ext4_inode_info | tail -n 1 SIZE: 0x3f8 36 Slab allocator ext4_inode_cachep = kmem_cache_create("ext4_inode_cache", sizeof(struct ext4_inode_info), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD), init_once); ext4_inode_cachep = kmem_cache_create("ext4_inode_cache", sizeof(struct ext4_inode_info), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD), init_once); crash> kmem -s ext4_inode_cache CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff880079fa1000 ext4_inode_cache 1016 48728 48728 6091 8k crash> kmem -s ext4_inode_cache CACHE NAME OBJSIZE ALLOCATED TOTAL SLABS SSIZE ffff880079fa1000 ext4_inode_cache 1016 48728 48728 6091 8k ext4_inode_cachep = kmem_cache_create("ext4_inode_cache", sizeof(struct ext4_inode_info), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD), init_once); ext4_inode_cachep = kmem_cache_create("ext4_inode_cache", sizeof(struct ext4_inode_info), 0, (SLAB_RECLAIM_ACCOUNT| SLAB_MEM_SPREAD), init_once); DEMO 38 CPUs ● There's no “cpu_t” structure ● Instead: ● DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); ● O(1) scheduler replaced with CFS ● Uses RB tree 39 runq ● runq crash> runq CPU 0 RUNQUEUE: ffff88007fc16c80 CURRENT: PID: 6864 TASK: ffff880036850000 COMMAND: "cat" RT PRIO_ARRAY: ffff88007fc16e30 [no tasks queued] CFS RB_ROOT: ffff88007fc16d20 [120] PID: 6464 TASK: ffff880036868000 COMMAND: "kworker/0:1" crash> runq CPU 0 RUNQUEUE: ffff88007fc16c80 CURRENT: PID: 6864 TASK: ffff880036850000 COMMAND: "cat" RT PRIO_ARRAY: ffff88007fc16e30 [no tasks queued] CFS RB_ROOT: ffff88007fc16d20 [120] PID: 6464 TASK: ffff880036868000 COMMAND: "kworker/0:1" DEMO THANK YOU plus.google.com/+RedHat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNewslinkedin.com/company/red-hat