Bug #1868
Ceph client crashed after shutting down one mds and osd
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Here is my cluster configuration before shutting down ceph components on n2cc (2.2.2.2).
# ceph -s -n mds.n2cc 2012-01-02 14:19:04.141557 pg v86136: 254 pgs: 254 active+clean; 143 MB data, 2205 MB used, 397 GB / 400 GB avail 2012-01-02 14:19:04.154427 mds e125: 1/1/1 up {0=n2cc=up:active}, 1 up:standby 2012-01-02 14:19:04.154603 osd e234: 2 osds: 2 up, 2 in 2012-01-02 14:19:04.168401 log 2012-01-02 13:09:21.170832 osd.0 1.1.1.1:6802/30288 1556 : [INF] 84.7 scrub ok 2012-01-02 14:19:04.168490 mon e1: 1 mons at {cc=1.1.1.1:6789/0}
Ceph cluster should be usable during n2cc downtime. However my client crashed (I am using rbd):
[ 2971.916917] libceph: osd0 1.1.1.1:6802 socket closed [ 3872.208920] libceph: osd0 1.1.1.1:6802 socket closed [ 3921.444546] libceph: osd1 2.2.2.2:6801 socket closed [ 3952.154957] libceph: osd1 down [ 3952.154996] ------------[ cut here ]------------ [ 3952.155000] kernel BUG at /build/buildd-linux-2.6_3.0.0-3-amd64-9ClimQ/linux-2.6-3.0.0/debian/build/source_amd64_none/net/ceph/messenger.c:2195! [ 3952.155007] invalid opcode: 0000 [#1] SMP [ 3952.155011] CPU 0 [ 3952.155013] Modules linked in: deflate zlib_deflate ctr camellia cast5 rmd160 sha1_generic hmac crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc sha256_generic sha512_generic des_generic xfrm_user ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 rng_core af_key ip6table_filter ip6_tables iptable_filter ip_tables x_tables xfs cryptd aes_x86_64 aes_generic cbc rbd libceph crc32c libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod ext2 fuse evdev snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront [ 3952.155106] [ 3952.155109] Pid: 17, comm: kworker/0:1 Not tainted 3.0.0-1-amd64 #1 [ 3952.155115] RIP: e030:[<ffffffffa01510f9>] [<ffffffffa01510f9>] ceph_con_send+0x6e/0xe7 [libceph] [ 3952.155126] RSP: e02b:ffff880016083c20 EFLAGS: 00010287 [ 3952.155130] RAX: ffff8800158e61c8 RBX: ffff8800158e6030 RCX: ffff880016047250 [ 3952.155134] RDX: ffff8800158fcf78 RSI: 0000000000000003 RDI: ffff8800158e61a8 [ 3952.155138] RBP: ffff8800158e61a8 R08: 0000000000000001 R09: ffff880016098dc0 [ 3952.155143] R10: ffff880015814280 R11: 0000000000000003 R12: ffff8800158e6058 [ 3952.155147] R13: ffff8800158fcf00 R14: ffff880016047250 R15: ffff880016047260 [ 3952.155155] FS: 00007fe13e631720(0000) GS:ffff880017af1000(0000) knlGS:0000000000000000 [ 3952.155160] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3952.155163] CR2: 00007fe13f542f90 CR3: 0000000014db4000 CR4: 0000000000002660 [ 3952.155168] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3952.155173] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3952.155177] Process kworker/0:1 (pid: 17, threadinfo ffff880016082000, task ffff880016081610) [ 3952.155182] Stack: [ 3952.155184] ffff88001586be00 ffff88001586be00 ffff8800160471a8 ffff880016047230 [ 3952.155192] ffff880016047200 ffffffffa0155ffb ffffffff81006040 ffff8800160471a8 [ 3952.155199] ffff880014837933 ffff88001499a680 0000000000000000 ffff88001600dac0 [ 3952.155206] Call Trace: [ 3952.155213] [<ffffffffa0155ffb>] ? send_queued+0xd2/0x10e [libceph] [ 3952.155221] [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa [ 3952.155229] [<ffffffffa0157dfd>] ? ceph_osdc_handle_map+0x2ca/0x32d [libceph] [ 3952.155237] [<ffffffffa0155172>] ? dispatch+0x416/0x472 [libceph] [ 3952.155244] [<ffffffffa0152993>] ? con_work+0xf2b/0x1d65 [libceph] [ 3952.155250] [<ffffffff810383fc>] ? need_resched+0x1a/0x23 [ 3952.155256] [<ffffffff813356f9>] ? schedule+0x5e5/0x5fc [ 3952.155263] [<ffffffffa0151a68>] ? read_partial_message_section.clone.9+0x74/0x74 [libceph] [ 3952.155270] [<ffffffff8105b943>] ? process_one_work+0x193/0x28f [ 3952.155275] [<ffffffff8105cacf>] ? worker_thread+0xef/0x172 [ 3952.155280] [<ffffffff8105c9e0>] ? manage_workers.clone.17+0x15b/0x15b [ 3952.155285] [<ffffffff8105fc0b>] ? kthread+0x7a/0x82 [ 3952.155290] [<ffffffff8133ce24>] ? kernel_thread_helper+0x4/0x10 [ 3952.155296] [<ffffffff8133bf33>] ? int_ret_from_sys_call+0x7/0x1b [ 3952.155301] [<ffffffff81336e61>] ? retint_restore_args+0x5/0x6 [ 3952.155305] [<ffffffff8133ce20>] ? gs_change+0x13/0x13 [ 3952.155309] Code: 48 39 46 50 74 02 0f 0b 48 8d af 78 01 00 00 c6 86 b2 00 00 00 01 48 89 ef e8 63 4e 1e e1 49 8b 45 78 49 8d 55 78 48 39 d0 74 02 <0f> 0b 48 8b 93 a0 01 00 00 48 8d 8b 98 01 00 00 48 89 83 a0 01 [ 3952.155360] RIP [<ffffffffa01510f9>] ceph_con_send+0x6e/0xe7 [libceph] [ 3952.155367] RSP <ffff880016083c20> [ 3952.155373] ---[ end trace 0885c879b6e74a54 ]--- [ 3952.155414] BUG: unable to handle kernel paging request at fffffffffffffff8 [ 3952.155420] IP: [<ffffffff8105fe38>] kthread_data+0x7/0xc [ 3952.155425] PGD 1605067 PUD 1606067 PMD 0 [ 3952.155431] Oops: 0000 [#2] SMP [ 3952.155435] CPU 0 [ 3952.155437] Modules linked in: deflate zlib_deflate ctr camellia cast5 rmd160 sha1_generic hmac crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc sha256_generic sha512_generic des_generic xfrm_user ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 rng_core af_key ip6table_filter ip6_tables iptable_filter ip_tables x_tables xfs cryptd aes_x86_64 aes_generic cbc rbd libceph crc32c libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod ext2 fuse evdev snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront [ 3952.155525] [ 3952.155528] Pid: 17, comm: kworker/0:1 Tainted: G D 3.0.0-1-amd64 #1 [ 3952.155534] RIP: e030:[<ffffffff8105fe38>] [<ffffffff8105fe38>] kthread_data+0x7/0xc [ 3952.155541] RSP: e02b:ffff880016083970 EFLAGS: 00010002 [ 3952.155544] RAX: 0000000000000000 RBX: ffff880017b03800 RCX: 0000000000000000 [ 3952.155548] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880016081610 [ 3952.155552] RBP: 0000000000000000 R08: 0000000000000400 R09: ffff8800175f7358 [ 3952.155557] R10: 0000000000000000 R11: 0720072007200720 R12: ffff880016083a58 [ 3952.155561] R13: ffff880017591510 R14: 0000000000000000 R15: ffff880016081908 [ 3952.155567] FS: 00007fe13e631720(0000) GS:ffff880017af1000(0000) knlGS:0000000000000000 [ 3952.155572] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [ 3952.155575] CR2: fffffffffffffff8 CR3: 0000000014db4000 CR4: 0000000000002660 [ 3952.155580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 3952.155584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 3952.155589] Process kworker/0:1 (pid: 17, threadinfo ffff880016082000, task ffff880016081610) [ 3952.155593] Stack: [ 3952.155595] ffffffff8105cdfb ffff880017b03800 ffff880016081610 ffff880016083a58 [ 3952.155603] ffffffff81335268 ffffffff81006040 ffffffff810065c2 0720072007200720 [ 3952.155610] 0000000000012800 ffff880016083fd8 ffff880016083fd8 0000000000012800 [ 3952.155617] Call Trace: [ 3952.155621] [<ffffffff8105cdfb>] ? wq_worker_sleeping+0xb/0x6e [ 3952.158894] [<ffffffff81335268>] ? schedule+0x154/0x5fc [ 3952.158894] [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa [ 3952.158894] [<ffffffff810065c2>] ? check_events+0x12/0x20 [ 3952.158894] [<ffffffff810eb9cf>] ? arch_local_irq_restore+0x7/0x8 [ 3952.158894] [<ffffffff810ed34e>] ? kmem_cache_free+0x2d/0x69 [ 3952.158894] [<ffffffff81095b03>] ? arch_local_irq_restore+0x7/0x8 [ 3952.158894] [<ffffffff81049ea0>] ? do_exit+0x73e/0x740 [ 3952.158894] [<ffffffff81071f0b>] ? arch_local_irq_restore+0x7/0x8 [ 3952.158894] [<ffffffff81337b7e>] ? oops_end+0xb1/0xb6 [ 3952.158894] [<ffffffff81009a21>] ? do_invalid_op+0x87/0x91 [ 3952.158894] [<ffffffffa01510f9>] ? ceph_con_send+0x6e/0xe7 [libceph] [ 3952.158894] [<ffffffffa0158425>] ? calc_pg_raw+0x178/0x190 [libceph] [ 3952.158894] [<ffffffff810383fc>] ? need_resched+0x1a/0x23 [ 3952.158894] [<ffffffff8133cc9b>] ? invalid_op+0x1b/0x20 [ 3952.158894] [<ffffffffa01510f9>] ? ceph_con_send+0x6e/0xe7 [libceph] [ 3952.158894] [<ffffffffa0155ffb>] ? send_queued+0xd2/0x10e [libceph] [ 3952.158894] [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa [ 3952.158894] [<ffffffffa0157dfd>] ? ceph_osdc_handle_map+0x2ca/0x32d [libceph] [ 3952.158894] [<ffffffffa0155172>] ? dispatch+0x416/0x472 [libceph] [ 3952.158894] [<ffffffffa0152993>] ? con_work+0xf2b/0x1d65 [libceph] [ 3952.158894] [<ffffffff810383fc>] ? need_resched+0x1a/0x23 [ 3952.158894] [<ffffffff813356f9>] ? schedule+0x5e5/0x5fc [ 3952.158894] [<ffffffffa0151a68>] ? read_partial_message_section.clone.9+0x74/0x74 [libceph] [ 3952.158894] [<ffffffff8105b943>] ? process_one_work+0x193/0x28f [ 3952.158894] [<ffffffff8105cacf>] ? worker_thread+0xef/0x172 [ 3952.158894] [<ffffffff8105c9e0>] ? manage_workers.clone.17+0x15b/0x15b [ 3952.158894] [<ffffffff8105fc0b>] ? kthread+0x7a/0x82 [ 3952.158894] [<ffffffff8133ce24>] ? kernel_thread_helper+0x4/0x10 [ 3952.158894] [<ffffffff8133bf33>] ? int_ret_from_sys_call+0x7/0x1b [ 3952.158894] [<ffffffff81336e61>] ? retint_restore_args+0x5/0x6 [ 3952.158894] [<ffffffff8133ce20>] ? gs_change+0x13/0x13 [ 3952.158894] Code: 37 20 fe ff 48 8b 7c 24 18 48 c7 c6 e0 5f 40 81 e8 51 25 fe ff 48 8b 44 24 18 48 81 c4 a8 00 00 00 5b 5d c3 48 8b 87 a0 02 00 00 [ 3952.158894] 8b 40 f8 c3 48 3b 3d 3c c5 71 00 75 08 0f bf 87 62 06 00 00 [ 3952.158894] RIP [<ffffffff8105fe38>] kthread_data+0x7/0xc [ 3952.158894] RSP <ffff880016083970> [ 3952.158894] CR2: fffffffffffffff8 [ 3952.158894] ---[ end trace 0885c879b6e74a55 ]--- [ 3952.158894] Fixing recursive fault but reboot is needed!
The client is running debian squeeze with kernel:
# apt-cache policy linux-image-`uname -r` linux-image-3.0.0-1-amd64: Installed: 3.0.0-3
History
#1 Updated by Sage Weil almost 12 years ago
- Status changed from New to Resolved
This bug was fixed by commit:935b639a049053d0ccbcf7422f2f9cd221642f58 in v3.1.
You should have better luck with the latest mainline kernel. We aren't at a point yet where we're backporting fixes to stable kernels.