Bug #473
closedKernel panic: ceph_pagelist_append
0%
Description
I was just doing a rsync of kernel.org, debian and ubuntu (simultaneous) and my client got a kernel panic.
The dmesg:
Oct 9 13:26:01 client01 kernel: [366850.430253] ceph: mds0 reconnect start Oct 9 13:26:01 client01 kernel: [366850.430715] general protection fault: 0000 [#1] SMP Oct 9 13:26:01 client01 kernel: [366850.430817] last sysfs file: /sys/module/aes_generic/initstate Oct 9 13:26:01 client01 kernel: [366850.430906] CPU 0 Oct 9 13:26:01 client01 kernel: [366850.430947] Modules linked in: cryptd aes_x86_64 aes_generic ceph libceph libcrc32c ip6table_filter ip6_tables crc32c ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm bridge stp bonding radeon ttm drm_kms_helper ppdev lp drm i2c_algo_bit parport_pc i3000_edac parport led_class shpchp edac_core serio_raw usb_storage floppy e1000e [last unloaded: libceph] Oct 9 13:26:01 client01 kernel: [366850.432179] Oct 9 13:26:01 client01 kernel: [366850.432202] Pid: 4, comm: kworker/0:0 Not tainted 2.6.36-rc5-rbd-20014-g53f0521 #3 PDSMi+/PDSMi Oct 9 13:26:01 client01 kernel: [366850.432334] RIP: 0010:[<ffffffff812c387b>] [<ffffffff812c387b>] memcpy+0xb/0xb0 Oct 9 13:26:01 client01 kernel: [366850.432468] RSP: 0018:ffff880216cdb968 EFLAGS: 00010202 Oct 9 13:26:01 client01 kernel: [366850.432552] RAX: 49081a8490e48000 RBX: ffff8800cffb1fc0 RCX: 0000000000000004 Oct 9 13:26:01 client01 kernel: [366850.432658] RDX: 0000000000000007 RSI: ffff8800cffb1e55 RDI: 49081a8490e48000 Oct 9 13:26:01 client01 kernel: [366850.432763] RBP: ffff880216cdb9a0 R08: 000000000000003d R09: 0000000000000050 Oct 9 13:26:01 client01 kernel: [366850.432868] R10: 0000000000000000 R11: dead000000200200 R12: 0000000000001000 Oct 9 13:26:01 client01 kernel: [366850.432974] R13: 0000000000000027 R14: ffff8800cffb1e55 R15: 6db6db6db6db6db7 Oct 9 13:26:01 client01 kernel: [366850.433080] FS: 0000000000000000(0000) GS:ffff880001e00000(0000) knlGS:0000000000000000 Oct 9 13:26:01 client01 kernel: [366850.433205] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Oct 9 13:26:01 client01 kernel: [366850.433290] CR2: 00007fcf741be000 CR3: 0000000214fcc000 CR4: 00000000000006f0 Oct 9 13:26:01 client01 kernel: [366850.433395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Oct 9 13:26:01 client01 kernel: [366850.433501] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Oct 9 13:26:01 client01 kernel: [366850.433606] Process kworker/0:0 (pid: 4, threadinfo ffff880216cda000, task ffff880216cc44a0) Oct 9 13:26:01 client01 kernel: [366850.433726] Stack: Oct 9 13:26:01 client01 kernel: [366850.433768] ffffffffa03d7f37 ffff880216cdb980 ffff8801a9e84d28 ffff8800cffb1fc0 Oct 9 13:26:01 client01 kernel: [366850.433929] <0> ffff880216cdba38 ffff8801a9dd1380 ffff880216cdbb60 ffff880216cdba80 Oct 9 13:26:01 client01 kernel: [366850.434118] <0> ffffffffa041010a 00000000ffffffff 000000000000003c ffff8801a9e848bc Oct 9 13:26:01 client01 kernel: [366850.434311] Call Trace: Oct 9 13:26:01 client01 kernel: [366850.434362] [<ffffffffa03d7f37>] ? ceph_pagelist_append+0x157/0x1d0 [libceph] Oct 9 13:26:01 client01 kernel: [366850.434477] [<ffffffffa041010a>] encode_caps_cb+0x2aa/0x3d0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.434585] [<ffffffffa040e73b>] iterate_session_caps+0xbb/0x1b0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.434692] [<ffffffff81100227>] ? __page_cache_alloc+0x87/0x90 Oct 9 13:26:01 client01 kernel: [366850.434790] [<ffffffffa040fe60>] ? encode_caps_cb+0x0/0x3d0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.434898] [<ffffffffa04115bb>] send_mds_reconnect+0x2fb/0x4b0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.435003] [<ffffffffa0411a5d>] ceph_mdsc_handle_map+0x29d/0x450 [ceph] Oct 9 13:26:01 client01 kernel: [366850.435114] [<ffffffffa03fbae9>] extra_mon_dispatch+0x29/0x30 [ceph] Oct 9 13:26:01 client01 kernel: [366850.435215] [<ffffffffa03d8ee0>] dispatch+0x50/0x590 [libceph] Oct 9 13:26:01 client01 kernel: [366850.435310] [<ffffffffa03d7621>] con_work+0x1501/0x1690 [libceph] Oct 9 13:26:01 client01 kernel: [366850.435413] [<ffffffff810591d1>] ? dequeue_entity+0x1b1/0x1f0 Oct 9 13:26:01 client01 kernel: [366850.435506] [<ffffffffa03d6120>] ? con_work+0x0/0x1690 [libceph] Oct 9 13:26:01 client01 kernel: [366850.435605] [<ffffffff81079b73>] process_one_work+0x123/0x440 Oct 9 13:26:01 client01 kernel: [366850.435695] [<ffffffff8107c0c0>] worker_thread+0x170/0x400 Oct 9 13:26:01 client01 kernel: [366850.440206] [<ffffffff8107bf50>] ? worker_thread+0x0/0x400 Oct 9 13:26:01 client01 kernel: [366850.440206] [<ffffffff81080486>] kthread+0x96/0xa0 Oct 9 13:26:01 client01 kernel: [366850.440206] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10 Oct 9 13:26:01 client01 kernel: [366850.440206] [<ffffffff810803f0>] ? kthread+0x0/0xa0 Oct 9 13:26:01 client01 kernel: [366850.440206] [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10 Oct 9 13:26:01 client01 kernel: [366850.440206] Code: 49 89 70 50 19 c0 49 89 70 58 41 c6 40 4c 04 83 e0 fc 83 c0 08 41 88 40 4d c9 c3 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 8b 1e 4c 8b 46 08 4c 89 1f 4c 89 47 08 Oct 9 13:26:01 client01 kernel: [366850.440206] RIP [<ffffffff812c387b>] memcpy+0xb/0xb0 Oct 9 13:26:01 client01 kernel: [366850.440206] RSP <ffff880216cdb968> Oct 9 13:26:01 client01 kernel: [366850.508857] ---[ end trace 1e7299bcbc67da97 ]--- Oct 9 13:26:01 client01 kernel: [366850.514683] Kernel panic - not syncing: Fatal exception Oct 9 13:26:01 client01 kernel: [366850.520220] Pid: 4, comm: kworker/0:0 Tainted: G D 2.6.36-rc5-rbd-20014-g53f0521 #3 Oct 9 13:26:01 client01 kernel: [366850.525839] Call Trace: Oct 9 13:26:01 client01 kernel: [366850.531249] [<ffffffff815943f6>] panic+0x91/0x1a1 Oct 9 13:26:01 client01 kernel: [366850.536552] [<ffffffff810610e5>] ? kmsg_dump+0x145/0x160 Oct 9 13:26:01 client01 kernel: [366850.541804] [<ffffffff81598a8c>] oops_end+0xdc/0xf0 Oct 9 13:26:01 client01 kernel: [366850.546965] [<ffffffff8100ed4b>] die+0x5b/0x90 Oct 9 13:26:01 client01 kernel: [366850.552091] [<ffffffff815985c2>] do_general_protection+0x162/0x170 Oct 9 13:26:01 client01 kernel: [366850.557173] [<ffffffff81597d25>] general_protection+0x25/0x30 Oct 9 13:26:01 client01 kernel: [366850.562249] [<ffffffff812c387b>] ? memcpy+0xb/0xb0 Oct 9 13:26:01 client01 kernel: [366850.567245] [<ffffffffa03d7f37>] ? ceph_pagelist_append+0x157/0x1d0 [libceph] Oct 9 13:26:01 client01 kernel: [366850.572290] [<ffffffffa041010a>] encode_caps_cb+0x2aa/0x3d0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.577325] [<ffffffffa040e73b>] iterate_session_caps+0xbb/0x1b0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.582383] [<ffffffff81100227>] ? __page_cache_alloc+0x87/0x90 Oct 9 13:26:01 client01 kernel: [366850.587360] [<ffffffffa040fe60>] ? encode_caps_cb+0x0/0x3d0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.592337] [<ffffffffa04115bb>] send_mds_reconnect+0x2fb/0x4b0 [ceph] Oct 9 13:26:01 client01 kernel: [366850.597339] [<ffffffffa0411a5d>] ceph_mdsc_handle_map+0x29d/0x450 [ceph] Oct 9 13:26:01 client01 kernel: [366850.602385] [<ffffffffa03fbae9>] extra_mon_dispatch+0x29/0x30 [ceph] Oct 9 13:26:01 client01 kernel: [366850.607380] [<ffffffffa03d8ee0>] dispatch+0x50/0x590 [libceph] Oct 9 13:26:01 client01 kernel: [366850.612317] [<ffffffffa03d7621>] con_work+0x1501/0x1690 [libceph] Oct 9 13:26:01 client01 kernel: [366850.617211] [<ffffffff810591d1>] ? dequeue_entity+0x1b1/0x1f0 Oct 9 13:26:01 client01 kernel: [366850.622055] [<ffffffffa03d6120>] ? con_work+0x0/0x1690 [libceph] Oct 9 13:26:01 client01 kernel: [366850.626811] [<ffffffff81079b73>] process_one_work+0x123/0x440 Oct 9 13:26:01 client01 kernel: [366850.631525] [<ffffffff8107c0c0>] worker_thread+0x170/0x400 Oct 9 13:26:01 client01 kernel: [366850.636214] [<ffffffff8107bf50>] ? worker_thread+0x0/0x400 Oct 9 13:26:01 client01 kernel: [366850.640878] [<ffffffff81080486>] kthread+0x96/0xa0 Oct 9 13:26:01 client01 kernel: [366850.645498] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10 Oct 9 13:26:01 client01 kernel: [366850.650123] [<ffffffff810803f0>] ? kthread+0x0/0xa0
This client is running the master branch from a few days ago.
It seems that one mds crashed/restarted and it seems so:
root@node13:~# ps aux|grep cmds root 10659 9.3 23.6 1079792 958204 ? Ssl 13:01 7:28 /usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf root 11023 0.0 0.0 7672 820 pts/0 S+ 14:21 0:00 grep --color=auto cmds root@node13:~#
root@node14:~# ps aux|grep cmds root 630 3.0 0.1 97136 8048 ? Ssl 13:28 1:38 /usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf root 680 0.0 0.0 7672 820 pts/0 S+ 14:22 0:00 grep --color=auto cmds root@node14:~#
While it seems that mds.1 restarted after this crash, it might be involved?
It did not manually invoke the restart, could it be that the mds did this by itself?
Updated by Wido den Hollander over 13 years ago
I checked my mds log, this shows:
2010-10-09 13:25:13.192992 7fa2fabc4710 mds0.2 ms_handle_connect on [2001:16f8:10:2::c3c3:2bfe]:6800/27355 2010-10-09 13:26:48.678418 7fa2fabc4710 mds0.2 ms_handle_connect on [2001:16f8:10:2::c3c3:8f6b]:6800/12577 2010-10-09 13:28:10.843957 7fa2f99c0710 mds0.2 beacon_kill last_acked_stamp 2010-10-09 13:27:55.843889, setting laggy flag. 2010-10-09 13:28:16.860165 7fa2f6fb2710 -- [2001:16f8:10:2::c3c3:2e5c]:6800/543 >> [2001:16f8:10:2::c3c3:6fd3]:0/464705235 pipe(0xeb6000 sd=-1 pgs=2 cs=1 l=0).fault initiating reconnect 2010-10-09 13:28:26.548477 7fa2f99c0710 mds0.2 beacon_kill last_acked_stamp 2010-10-09 13:27:59.843996, setting laggy flag. 2010-10-09 13:28:26.548659 7fa2fabc4710 mds-1.2 handle_mds_map i ([2001:16f8:10:2::c3c3:2e5c]:6800/543) dne in the mdsmap, respawning myself 2010-10-09 13:28:26.548681 7fa2fabc4710 mds-1.2 respawn 2010-10-09 13:28:26.548690 7fa2fabc4710 mds-1.2 e: '/usr/bin/cmds' 2010-10-09 13:28:26.548698 7fa2fabc4710 mds-1.2 0: '/usr/bin/cmds' 2010-10-09 13:28:26.548706 7fa2fabc4710 mds-1.2 1: '-i' 2010-10-09 13:28:26.548713 7fa2fabc4710 mds-1.2 2: '1' 2010-10-09 13:28:26.548721 7fa2fabc4710 mds-1.2 3: '-c' 2010-10-09 13:28:26.548729 7fa2fabc4710 mds-1.2 4: '/etc/ceph/ceph.conf' 2010-10-09 13:28:26.548756 7fa2fabc4710 mds-1.2 cwd / 2010-10-09 13:28:26.577964 --- 544 opened log /var/log/ceph/mds.1.log --
As you can see, the MDS respawned.
2001:16f8:10:2::c3c3:6fd3 is the IP of my client.
Updated by Sage Weil over 13 years ago
Do you know the commit id the client was running?
Updated by Wido den Hollander over 13 years ago
I'm not completely sure, I see my vmlinuz is from 30-09-2010, so about 12 days old.
vmlinuz-2.6.36-rc5-rbd-20014-g53f0521
Updated by Sage Weil over 13 years ago
commit:299ef41b70e26e6725073c2d0f85e5da7aa547d0 touches similar code, although it's not clear to me that it could cause this crash...
Updated by Yehuda Sadeh over 13 years ago
It looks like it wasn't the master branch, but some outtake from the unstable branch, probably commit:53f05210b418eaa8ba747611d4f1506abfdef2e0.
Updated by Sage Weil over 13 years ago
- Status changed from New to Can't reproduce