Project

General

Profile

Actions

Bug #473

closed

Kernel panic: ceph_pagelist_append

Added by Wido den Hollander over 13 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
Clustered MDS
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I was just doing a rsync of kernel.org, debian and ubuntu (simultaneous) and my client got a kernel panic.

The dmesg:

Oct  9 13:26:01 client01 kernel: [366850.430253] ceph: mds0 reconnect start
Oct  9 13:26:01 client01 kernel: [366850.430715] general protection fault: 0000 [#1] SMP 
Oct  9 13:26:01 client01 kernel: [366850.430817] last sysfs file: /sys/module/aes_generic/initstate
Oct  9 13:26:01 client01 kernel: [366850.430906] CPU 0 
Oct  9 13:26:01 client01 kernel: [366850.430947] Modules linked in: cryptd aes_x86_64 aes_generic ceph libceph libcrc32c ip6table_filter ip6_tables crc32c ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables kvm_intel kvm bridge stp bonding radeon ttm drm_kms_helper ppdev lp drm i2c_algo_bit parport_pc i3000_edac parport led_class shpchp edac_core serio_raw usb_storage floppy e1000e [last unloaded: libceph]
Oct  9 13:26:01 client01 kernel: [366850.432179] 
Oct  9 13:26:01 client01 kernel: [366850.432202] Pid: 4, comm: kworker/0:0 Not tainted 2.6.36-rc5-rbd-20014-g53f0521 #3 PDSMi+/PDSMi
Oct  9 13:26:01 client01 kernel: [366850.432334] RIP: 0010:[<ffffffff812c387b>]  [<ffffffff812c387b>] memcpy+0xb/0xb0
Oct  9 13:26:01 client01 kernel: [366850.432468] RSP: 0018:ffff880216cdb968  EFLAGS: 00010202
Oct  9 13:26:01 client01 kernel: [366850.432552] RAX: 49081a8490e48000 RBX: ffff8800cffb1fc0 RCX: 0000000000000004
Oct  9 13:26:01 client01 kernel: [366850.432658] RDX: 0000000000000007 RSI: ffff8800cffb1e55 RDI: 49081a8490e48000
Oct  9 13:26:01 client01 kernel: [366850.432763] RBP: ffff880216cdb9a0 R08: 000000000000003d R09: 0000000000000050
Oct  9 13:26:01 client01 kernel: [366850.432868] R10: 0000000000000000 R11: dead000000200200 R12: 0000000000001000
Oct  9 13:26:01 client01 kernel: [366850.432974] R13: 0000000000000027 R14: ffff8800cffb1e55 R15: 6db6db6db6db6db7
Oct  9 13:26:01 client01 kernel: [366850.433080] FS:  0000000000000000(0000) GS:ffff880001e00000(0000) knlGS:0000000000000000
Oct  9 13:26:01 client01 kernel: [366850.433205] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct  9 13:26:01 client01 kernel: [366850.433290] CR2: 00007fcf741be000 CR3: 0000000214fcc000 CR4: 00000000000006f0
Oct  9 13:26:01 client01 kernel: [366850.433395] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  9 13:26:01 client01 kernel: [366850.433501] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  9 13:26:01 client01 kernel: [366850.433606] Process kworker/0:0 (pid: 4, threadinfo ffff880216cda000, task ffff880216cc44a0)
Oct  9 13:26:01 client01 kernel: [366850.433726] Stack:
Oct  9 13:26:01 client01 kernel: [366850.433768]  ffffffffa03d7f37 ffff880216cdb980 ffff8801a9e84d28 ffff8800cffb1fc0
Oct  9 13:26:01 client01 kernel: [366850.433929] <0> ffff880216cdba38 ffff8801a9dd1380 ffff880216cdbb60 ffff880216cdba80
Oct  9 13:26:01 client01 kernel: [366850.434118] <0> ffffffffa041010a 00000000ffffffff 000000000000003c ffff8801a9e848bc
Oct  9 13:26:01 client01 kernel: [366850.434311] Call Trace:
Oct  9 13:26:01 client01 kernel: [366850.434362]  [<ffffffffa03d7f37>] ? ceph_pagelist_append+0x157/0x1d0 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.434477]  [<ffffffffa041010a>] encode_caps_cb+0x2aa/0x3d0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.434585]  [<ffffffffa040e73b>] iterate_session_caps+0xbb/0x1b0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.434692]  [<ffffffff81100227>] ? __page_cache_alloc+0x87/0x90
Oct  9 13:26:01 client01 kernel: [366850.434790]  [<ffffffffa040fe60>] ? encode_caps_cb+0x0/0x3d0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.434898]  [<ffffffffa04115bb>] send_mds_reconnect+0x2fb/0x4b0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.435003]  [<ffffffffa0411a5d>] ceph_mdsc_handle_map+0x29d/0x450 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.435114]  [<ffffffffa03fbae9>] extra_mon_dispatch+0x29/0x30 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.435215]  [<ffffffffa03d8ee0>] dispatch+0x50/0x590 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.435310]  [<ffffffffa03d7621>] con_work+0x1501/0x1690 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.435413]  [<ffffffff810591d1>] ? dequeue_entity+0x1b1/0x1f0
Oct  9 13:26:01 client01 kernel: [366850.435506]  [<ffffffffa03d6120>] ? con_work+0x0/0x1690 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.435605]  [<ffffffff81079b73>] process_one_work+0x123/0x440
Oct  9 13:26:01 client01 kernel: [366850.435695]  [<ffffffff8107c0c0>] worker_thread+0x170/0x400
Oct  9 13:26:01 client01 kernel: [366850.440206]  [<ffffffff8107bf50>] ? worker_thread+0x0/0x400
Oct  9 13:26:01 client01 kernel: [366850.440206]  [<ffffffff81080486>] kthread+0x96/0xa0
Oct  9 13:26:01 client01 kernel: [366850.440206]  [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
Oct  9 13:26:01 client01 kernel: [366850.440206]  [<ffffffff810803f0>] ? kthread+0x0/0xa0
Oct  9 13:26:01 client01 kernel: [366850.440206]  [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10
Oct  9 13:26:01 client01 kernel: [366850.440206] Code: 49 89 70 50 19 c0 49 89 70 58 41 c6 40 4c 04 83 e0 fc 83 c0 08 41 88 40 4d c9 c3 90 90 90 90 90 48 89 f8 89 d1 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 8b 1e 4c 8b 46 08 4c 89 1f 4c 89 47 08 
Oct  9 13:26:01 client01 kernel: [366850.440206] RIP  [<ffffffff812c387b>] memcpy+0xb/0xb0
Oct  9 13:26:01 client01 kernel: [366850.440206]  RSP <ffff880216cdb968>
Oct  9 13:26:01 client01 kernel: [366850.508857] ---[ end trace 1e7299bcbc67da97 ]---
Oct  9 13:26:01 client01 kernel: [366850.514683] Kernel panic - not syncing: Fatal exception
Oct  9 13:26:01 client01 kernel: [366850.520220] Pid: 4, comm: kworker/0:0 Tainted: G      D     2.6.36-rc5-rbd-20014-g53f0521 #3
Oct  9 13:26:01 client01 kernel: [366850.525839] Call Trace:
Oct  9 13:26:01 client01 kernel: [366850.531249]  [<ffffffff815943f6>] panic+0x91/0x1a1
Oct  9 13:26:01 client01 kernel: [366850.536552]  [<ffffffff810610e5>] ? kmsg_dump+0x145/0x160
Oct  9 13:26:01 client01 kernel: [366850.541804]  [<ffffffff81598a8c>] oops_end+0xdc/0xf0
Oct  9 13:26:01 client01 kernel: [366850.546965]  [<ffffffff8100ed4b>] die+0x5b/0x90
Oct  9 13:26:01 client01 kernel: [366850.552091]  [<ffffffff815985c2>] do_general_protection+0x162/0x170
Oct  9 13:26:01 client01 kernel: [366850.557173]  [<ffffffff81597d25>] general_protection+0x25/0x30
Oct  9 13:26:01 client01 kernel: [366850.562249]  [<ffffffff812c387b>] ? memcpy+0xb/0xb0
Oct  9 13:26:01 client01 kernel: [366850.567245]  [<ffffffffa03d7f37>] ? ceph_pagelist_append+0x157/0x1d0 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.572290]  [<ffffffffa041010a>] encode_caps_cb+0x2aa/0x3d0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.577325]  [<ffffffffa040e73b>] iterate_session_caps+0xbb/0x1b0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.582383]  [<ffffffff81100227>] ? __page_cache_alloc+0x87/0x90
Oct  9 13:26:01 client01 kernel: [366850.587360]  [<ffffffffa040fe60>] ? encode_caps_cb+0x0/0x3d0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.592337]  [<ffffffffa04115bb>] send_mds_reconnect+0x2fb/0x4b0 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.597339]  [<ffffffffa0411a5d>] ceph_mdsc_handle_map+0x29d/0x450 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.602385]  [<ffffffffa03fbae9>] extra_mon_dispatch+0x29/0x30 [ceph]
Oct  9 13:26:01 client01 kernel: [366850.607380]  [<ffffffffa03d8ee0>] dispatch+0x50/0x590 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.612317]  [<ffffffffa03d7621>] con_work+0x1501/0x1690 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.617211]  [<ffffffff810591d1>] ? dequeue_entity+0x1b1/0x1f0
Oct  9 13:26:01 client01 kernel: [366850.622055]  [<ffffffffa03d6120>] ? con_work+0x0/0x1690 [libceph]
Oct  9 13:26:01 client01 kernel: [366850.626811]  [<ffffffff81079b73>] process_one_work+0x123/0x440
Oct  9 13:26:01 client01 kernel: [366850.631525]  [<ffffffff8107c0c0>] worker_thread+0x170/0x400
Oct  9 13:26:01 client01 kernel: [366850.636214]  [<ffffffff8107bf50>] ? worker_thread+0x0/0x400
Oct  9 13:26:01 client01 kernel: [366850.640878]  [<ffffffff81080486>] kthread+0x96/0xa0
Oct  9 13:26:01 client01 kernel: [366850.645498]  [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
Oct  9 13:26:01 client01 kernel: [366850.650123]  [<ffffffff810803f0>] ? kthread+0x0/0xa0

This client is running the master branch from a few days ago.

It seems that one mds crashed/restarted and it seems so:

root@node13:~# ps aux|grep cmds
root     10659  9.3 23.6 1079792 958204 ?      Ssl  13:01   7:28 /usr/bin/cmds -i 0 -c /etc/ceph/ceph.conf
root     11023  0.0  0.0   7672   820 pts/0    S+   14:21   0:00 grep --color=auto cmds
root@node13:~#
root@node14:~# ps aux|grep cmds
root       630  3.0  0.1  97136  8048 ?        Ssl  13:28   1:38 /usr/bin/cmds -i 1 -c /etc/ceph/ceph.conf
root       680  0.0  0.0   7672   820 pts/0    S+   14:22   0:00 grep --color=auto cmds
root@node14:~#

While it seems that mds.1 restarted after this crash, it might be involved?

It did not manually invoke the restart, could it be that the mds did this by itself?

Actions #1

Updated by Wido den Hollander over 13 years ago

I checked my mds log, this shows:

2010-10-09 13:25:13.192992 7fa2fabc4710 mds0.2 ms_handle_connect on [2001:16f8:10:2::c3c3:2bfe]:6800/27355
2010-10-09 13:26:48.678418 7fa2fabc4710 mds0.2 ms_handle_connect on [2001:16f8:10:2::c3c3:8f6b]:6800/12577
2010-10-09 13:28:10.843957 7fa2f99c0710 mds0.2 beacon_kill last_acked_stamp 2010-10-09 13:27:55.843889, setting laggy flag.
2010-10-09 13:28:16.860165 7fa2f6fb2710 -- [2001:16f8:10:2::c3c3:2e5c]:6800/543 >> [2001:16f8:10:2::c3c3:6fd3]:0/464705235 pipe(0xeb6000 sd=-1 pgs=2 cs=1 l=0).fault initiating reconnect
2010-10-09 13:28:26.548477 7fa2f99c0710 mds0.2 beacon_kill last_acked_stamp 2010-10-09 13:27:59.843996, setting laggy flag.
2010-10-09 13:28:26.548659 7fa2fabc4710 mds-1.2 handle_mds_map i ([2001:16f8:10:2::c3c3:2e5c]:6800/543) dne in the mdsmap, respawning myself
2010-10-09 13:28:26.548681 7fa2fabc4710 mds-1.2 respawn
2010-10-09 13:28:26.548690 7fa2fabc4710 mds-1.2  e: '/usr/bin/cmds'
2010-10-09 13:28:26.548698 7fa2fabc4710 mds-1.2  0: '/usr/bin/cmds'
2010-10-09 13:28:26.548706 7fa2fabc4710 mds-1.2  1: '-i'
2010-10-09 13:28:26.548713 7fa2fabc4710 mds-1.2  2: '1'
2010-10-09 13:28:26.548721 7fa2fabc4710 mds-1.2  3: '-c'
2010-10-09 13:28:26.548729 7fa2fabc4710 mds-1.2  4: '/etc/ceph/ceph.conf'
2010-10-09 13:28:26.548756 7fa2fabc4710 mds-1.2  cwd /
2010-10-09 13:28:26.577964 --- 544 opened log /var/log/ceph/mds.1.log --

As you can see, the MDS respawned.

2001:16f8:10:2::c3c3:6fd3 is the IP of my client.

Actions #2

Updated by Sage Weil over 13 years ago

Do you know the commit id the client was running?

Actions #3

Updated by Wido den Hollander over 13 years ago

I'm not completely sure, I see my vmlinuz is from 30-09-2010, so about 12 days old.

vmlinuz-2.6.36-rc5-rbd-20014-g53f0521

Actions #4

Updated by Sage Weil over 13 years ago

commit:299ef41b70e26e6725073c2d0f85e5da7aa547d0 touches similar code, although it's not clear to me that it could cause this crash...

Actions #5

Updated by Sage Weil over 13 years ago

  • Assignee set to Yehuda Sadeh
Actions #6

Updated by Yehuda Sadeh over 13 years ago

It looks like it wasn't the master branch, but some outtake from the unstable branch, probably commit:53f05210b418eaa8ba747611d4f1506abfdef2e0.

Actions #7

Updated by Sage Weil over 13 years ago

  • Assignee deleted (Yehuda Sadeh)
Actions #8

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF