Project

General

Profile

Bug #1868

Ceph client crashed after shutting down one mds and osd

Added by Maciej Galkiewicz about 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Here is my cluster configuration before shutting down ceph components on n2cc (2.2.2.2).

# ceph -s -n mds.n2cc
2012-01-02 14:19:04.141557    pg v86136: 254 pgs: 254 active+clean; 143 MB data, 2205 MB used, 397 GB / 400 GB avail
2012-01-02 14:19:04.154427   mds e125: 1/1/1 up {0=n2cc=up:active}, 1 up:standby
2012-01-02 14:19:04.154603   osd e234: 2 osds: 2 up, 2 in
2012-01-02 14:19:04.168401   log 2012-01-02 13:09:21.170832 osd.0 1.1.1.1:6802/30288 1556 : [INF] 84.7 scrub ok
2012-01-02 14:19:04.168490   mon e1: 1 mons at {cc=1.1.1.1:6789/0}

Ceph cluster should be usable during n2cc downtime. However my client crashed (I am using rbd):

[ 2971.916917] libceph: osd0 1.1.1.1:6802 socket closed
[ 3872.208920] libceph: osd0 1.1.1.1:6802 socket closed
[ 3921.444546] libceph: osd1 2.2.2.2:6801 socket closed
[ 3952.154957] libceph: osd1 down
[ 3952.154996] ------------[ cut here ]------------
[ 3952.155000] kernel BUG at /build/buildd-linux-2.6_3.0.0-3-amd64-9ClimQ/linux-2.6-3.0.0/debian/build/source_amd64_none/net/ceph/messenger.c:2195!
[ 3952.155007] invalid opcode: 0000 [#1] SMP
[ 3952.155011] CPU 0
[ 3952.155013] Modules linked in: deflate zlib_deflate ctr camellia cast5 rmd160 sha1_generic hmac crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc sha256_generic sha512_generic des_generic xfrm_user ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 rng_core af_key ip6table_filter ip6_tables iptable_filter ip_tables x_tables xfs cryptd aes_x86_64 aes_generic cbc rbd libceph crc32c libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod ext2 fuse evdev snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
[ 3952.155106]
[ 3952.155109] Pid: 17, comm: kworker/0:1 Not tainted 3.0.0-1-amd64 #1
[ 3952.155115] RIP: e030:[<ffffffffa01510f9>]  [<ffffffffa01510f9>] ceph_con_send+0x6e/0xe7 [libceph]
[ 3952.155126] RSP: e02b:ffff880016083c20  EFLAGS: 00010287
[ 3952.155130] RAX: ffff8800158e61c8 RBX: ffff8800158e6030 RCX: ffff880016047250
[ 3952.155134] RDX: ffff8800158fcf78 RSI: 0000000000000003 RDI: ffff8800158e61a8
[ 3952.155138] RBP: ffff8800158e61a8 R08: 0000000000000001 R09: ffff880016098dc0
[ 3952.155143] R10: ffff880015814280 R11: 0000000000000003 R12: ffff8800158e6058
[ 3952.155147] R13: ffff8800158fcf00 R14: ffff880016047250 R15: ffff880016047260
[ 3952.155155] FS:  00007fe13e631720(0000) GS:ffff880017af1000(0000) knlGS:0000000000000000
[ 3952.155160] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3952.155163] CR2: 00007fe13f542f90 CR3: 0000000014db4000 CR4: 0000000000002660
[ 3952.155168] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3952.155173] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3952.155177] Process kworker/0:1 (pid: 17, threadinfo ffff880016082000, task ffff880016081610)
[ 3952.155182] Stack:
[ 3952.155184]  ffff88001586be00 ffff88001586be00 ffff8800160471a8 ffff880016047230
[ 3952.155192]  ffff880016047200 ffffffffa0155ffb ffffffff81006040 ffff8800160471a8
[ 3952.155199]  ffff880014837933 ffff88001499a680 0000000000000000 ffff88001600dac0
[ 3952.155206] Call Trace:  
[ 3952.155213]  [<ffffffffa0155ffb>] ? send_queued+0xd2/0x10e [libceph]
[ 3952.155221]  [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa
[ 3952.155229]  [<ffffffffa0157dfd>] ? ceph_osdc_handle_map+0x2ca/0x32d [libceph]
[ 3952.155237]  [<ffffffffa0155172>] ? dispatch+0x416/0x472 [libceph]
[ 3952.155244]  [<ffffffffa0152993>] ? con_work+0xf2b/0x1d65 [libceph]
[ 3952.155250]  [<ffffffff810383fc>] ? need_resched+0x1a/0x23
[ 3952.155256]  [<ffffffff813356f9>] ? schedule+0x5e5/0x5fc
[ 3952.155263]  [<ffffffffa0151a68>] ? read_partial_message_section.clone.9+0x74/0x74 [libceph]
[ 3952.155270]  [<ffffffff8105b943>] ? process_one_work+0x193/0x28f
[ 3952.155275]  [<ffffffff8105cacf>] ? worker_thread+0xef/0x172
[ 3952.155280]  [<ffffffff8105c9e0>] ? manage_workers.clone.17+0x15b/0x15b
[ 3952.155285]  [<ffffffff8105fc0b>] ? kthread+0x7a/0x82
[ 3952.155290]  [<ffffffff8133ce24>] ? kernel_thread_helper+0x4/0x10
[ 3952.155296]  [<ffffffff8133bf33>] ? int_ret_from_sys_call+0x7/0x1b
[ 3952.155301]  [<ffffffff81336e61>] ? retint_restore_args+0x5/0x6
[ 3952.155305]  [<ffffffff8133ce20>] ? gs_change+0x13/0x13
[ 3952.155309] Code: 48 39 46 50 74 02 0f 0b 48 8d af 78 01 00 00 c6 86 b2 00 00 00 01 48 89 ef e8 63 4e 1e e1 49 8b 45 78 49 8d 55 78 48 39 d0 74 02 <0f> 0b 48 8b 93 a0 01 00 00 48 8d 8b 98 01 00 00 48 89 83 a0 01
[ 3952.155360] RIP  [<ffffffffa01510f9>] ceph_con_send+0x6e/0xe7 [libceph]
[ 3952.155367]  RSP <ffff880016083c20>
[ 3952.155373] ---[ end trace 0885c879b6e74a54 ]---
[ 3952.155414] BUG: unable to handle kernel paging request at fffffffffffffff8
[ 3952.155420] IP: [<ffffffff8105fe38>] kthread_data+0x7/0xc
[ 3952.155425] PGD 1605067 PUD 1606067 PMD 0
[ 3952.155431] Oops: 0000 [#2] SMP
[ 3952.155435] CPU 0
[ 3952.155437] Modules linked in: deflate zlib_deflate ctr camellia cast5 rmd160 sha1_generic hmac crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc sha256_generic sha512_generic des_generic xfrm_user ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 rng_core af_key ip6table_filter ip6_tables iptable_filter ip_tables x_tables xfs cryptd aes_x86_64 aes_generic cbc rbd libceph crc32c libcrc32c ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_mod ext2 fuse evdev snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
[ 3952.155525]
[ 3952.155528] Pid: 17, comm: kworker/0:1 Tainted: G      D      3.0.0-1-amd64 #1
[ 3952.155534] RIP: e030:[<ffffffff8105fe38>]  [<ffffffff8105fe38>] kthread_data+0x7/0xc
[ 3952.155541] RSP: e02b:ffff880016083970  EFLAGS: 00010002
[ 3952.155544] RAX: 0000000000000000 RBX: ffff880017b03800 RCX: 0000000000000000
[ 3952.155548] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880016081610
[ 3952.155552] RBP: 0000000000000000 R08: 0000000000000400 R09: ffff8800175f7358
[ 3952.155557] R10: 0000000000000000 R11: 0720072007200720 R12: ffff880016083a58
[ 3952.155561] R13: ffff880017591510 R14: 0000000000000000 R15: ffff880016081908
[ 3952.155567] FS:  00007fe13e631720(0000) GS:ffff880017af1000(0000) knlGS:0000000000000000
[ 3952.155572] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3952.155575] CR2: fffffffffffffff8 CR3: 0000000014db4000 CR4: 0000000000002660
[ 3952.155580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3952.155584] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3952.155589] Process kworker/0:1 (pid: 17, threadinfo ffff880016082000, task ffff880016081610)
[ 3952.155593] Stack:
[ 3952.155595]  ffffffff8105cdfb ffff880017b03800 ffff880016081610 ffff880016083a58
[ 3952.155603]  ffffffff81335268 ffffffff81006040 ffffffff810065c2 0720072007200720
[ 3952.155610]  0000000000012800 ffff880016083fd8 ffff880016083fd8 0000000000012800
[ 3952.155617] Call Trace:  
[ 3952.155621]  [<ffffffff8105cdfb>] ? wq_worker_sleeping+0xb/0x6e
[ 3952.158894]  [<ffffffff81335268>] ? schedule+0x154/0x5fc
[ 3952.158894]  [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa
[ 3952.158894]  [<ffffffff810065c2>] ? check_events+0x12/0x20
[ 3952.158894]  [<ffffffff810eb9cf>] ? arch_local_irq_restore+0x7/0x8
[ 3952.158894]  [<ffffffff810ed34e>] ? kmem_cache_free+0x2d/0x69
[ 3952.158894]  [<ffffffff81095b03>] ? arch_local_irq_restore+0x7/0x8
[ 3952.158894]  [<ffffffff81049ea0>] ? do_exit+0x73e/0x740
[ 3952.158894]  [<ffffffff81071f0b>] ? arch_local_irq_restore+0x7/0x8
[ 3952.158894]  [<ffffffff81337b7e>] ? oops_end+0xb1/0xb6
[ 3952.158894]  [<ffffffff81009a21>] ? do_invalid_op+0x87/0x91
[ 3952.158894]  [<ffffffffa01510f9>] ? ceph_con_send+0x6e/0xe7 [libceph]
[ 3952.158894]  [<ffffffffa0158425>] ? calc_pg_raw+0x178/0x190 [libceph]
[ 3952.158894]  [<ffffffff810383fc>] ? need_resched+0x1a/0x23
[ 3952.158894]  [<ffffffff8133cc9b>] ? invalid_op+0x1b/0x20
[ 3952.158894]  [<ffffffffa01510f9>] ? ceph_con_send+0x6e/0xe7 [libceph]
[ 3952.158894]  [<ffffffffa0155ffb>] ? send_queued+0xd2/0x10e [libceph]
[ 3952.158894]  [<ffffffff81006040>] ? xen_force_evtchn_callback+0x9/0xa
[ 3952.158894]  [<ffffffffa0157dfd>] ? ceph_osdc_handle_map+0x2ca/0x32d [libceph]
[ 3952.158894]  [<ffffffffa0155172>] ? dispatch+0x416/0x472 [libceph]
[ 3952.158894]  [<ffffffffa0152993>] ? con_work+0xf2b/0x1d65 [libceph]
[ 3952.158894]  [<ffffffff810383fc>] ? need_resched+0x1a/0x23
[ 3952.158894]  [<ffffffff813356f9>] ? schedule+0x5e5/0x5fc
[ 3952.158894]  [<ffffffffa0151a68>] ? read_partial_message_section.clone.9+0x74/0x74 [libceph]
[ 3952.158894]  [<ffffffff8105b943>] ? process_one_work+0x193/0x28f
[ 3952.158894]  [<ffffffff8105cacf>] ? worker_thread+0xef/0x172
[ 3952.158894]  [<ffffffff8105c9e0>] ? manage_workers.clone.17+0x15b/0x15b
[ 3952.158894]  [<ffffffff8105fc0b>] ? kthread+0x7a/0x82
[ 3952.158894]  [<ffffffff8133ce24>] ? kernel_thread_helper+0x4/0x10
[ 3952.158894]  [<ffffffff8133bf33>] ? int_ret_from_sys_call+0x7/0x1b
[ 3952.158894]  [<ffffffff81336e61>] ? retint_restore_args+0x5/0x6
[ 3952.158894]  [<ffffffff8133ce20>] ? gs_change+0x13/0x13
[ 3952.158894] Code: 37 20 fe ff 48 8b 7c 24 18 48 c7 c6 e0 5f 40 81 e8 51 25 fe ff 48 8b 44 24 18 48 81 c4 a8 00 00 00 5b 5d c3 48 8b 87 a0 02 00 00
[ 3952.158894]  8b 40 f8 c3 48 3b 3d 3c c5 71 00 75 08 0f bf 87 62 06 00 00
[ 3952.158894] RIP  [<ffffffff8105fe38>] kthread_data+0x7/0xc
[ 3952.158894]  RSP <ffff880016083970>
[ 3952.158894] CR2: fffffffffffffff8
[ 3952.158894] ---[ end trace 0885c879b6e74a55 ]---
[ 3952.158894] Fixing recursive fault but reboot is needed!

The client is running debian squeeze with kernel:

# apt-cache policy linux-image-`uname -r`
linux-image-3.0.0-1-amd64:
  Installed: 3.0.0-3

History

#1 Updated by Sage Weil about 12 years ago

  • Status changed from New to Resolved

This bug was fixed by commit:935b639a049053d0ccbcf7422f2f9cd221642f58 in v3.1.

You should have better luck with the latest mainline kernel. We aren't at a point yet where we're backporting fixes to stable kernels.

Also available in: Atom PDF