Project

General

Profile

Bug #64

crash in handle_mds_map (corrupt s_waiting list?)

Added by Sage Weil almost 14 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

unstable branch.

ceph: mds0 caps stale
ceph: mds0 caps stale
ceph: mds0 hung
ceph: mds0 came back
ceph: mds0 caps renewed
ceph: mds0 reconnect completed
ceph: mds0 caps stale
ceph: mds0 hung
ceph: mds0 came back
ceph: mds0 caps went stale, renewing
ceph: mds0 hung
ceph: reconnect to recovering mds0
ceph: mds0 10.178.28.98:6800 socket closed
ceph: mds0 10.178.28.98:6800 connection failed
ceph: mds0 caps stale
ceph: reconnect to recovering mds0
ceph: mon0 10.178.28.97:6789 socket closed
ceph: mon0 10.178.28.97:6789 session lost, hunting for new mon
ceph: mon1 10.178.28.98:6789 session established
device fsid 9745a95d7084fe84-3f12a96d503909a6 devid 1 transid 2365 /dev/sdb1
cmon2605: segfault at 8 ip 000000312120b0d2 sp 0000000041b696a8 error 6 in libpthread-2.5.so[3121200000+16000]
ceph: mds0 caps stale
ceph: mds0 caps stale
------------[ cut here ]------------
WARNING: at lib/list_debug.c:26 __list_add+0x39/0x7d()
Hardware name: PowerEdge 2950
list_add corruption. next->prev should be prev (ffffffff817406c0), but was (null). (next=ffff88043c494db8).
Modules linked in: nfsd exportfs loop ceph nfs lockd fscache nfs_acl auth_rpcgss sunrpc autofs4 fuse rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core dm_mirror dm_multipath scsi_dh video output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport joydev sg bnx2 dcdbas sr_mod button cdrom tpm_tis serio_raw myri10ge tpm rtc_cmos rtc_core rtc_lib tpm_bios pcspkr i5000_edac edac_core btrfs zlib_deflate dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]
Pid: 3029, comm: mount Tainted: G W 2.6.34-rc5 #4
Call Trace:
[<ffffffff81170318>] ? __list_add+0x39/0x7d
[<ffffffff81037ee1>] ? warn_slowpath_common+0x77/0x8e
[<ffffffff81037f54>] ? warn_slowpath_fmt+0x51/0x59
[<ffffffff810c0e67>] ? pcpu_alloc+0x70b/0x7ec
[<ffffffff81170318>] ? __list_add+0x39/0x7d
[<ffffffff8117362e>] ? __percpu_counter_init+0x4e/0x60
[<ffffffff810a1c9d>] ? bdi_init+0x108/0x168
[<ffffffffa04e9bfc>] ? ceph_get_sb+0x4b2/0x97a [ceph]
[<ffffffff810a0882>] ? kstrdup+0x25/0xb2
[<ffffffff810c48ea>] ? vfs_kern_mount+0xa9/0x159
[<ffffffff810c49ed>] ? do_kern_mount+0x43/0xe3
[<ffffffff810d7bdf>] ? do_mount+0x6db/0x77b
[<ffffffff810d7cff>] ? sys_mount+0x80/0xba
[<ffffffff810028ab>] ? system_call_fastpath+0x16/0x1b
---[ end trace 4c26dd56ae7be0d2 ]---
ceph: mon1 10.178.28.98:6789 socket closed
ceph: mon1 10.178.28.98:6789 session lost, hunting for new mon
ceph: mon1 10.178.28.98:6789 connection failed
ceph: mds0 10.178.28.98:6800 socket closed
ceph: mds0 10.178.28.98:6800 connection failed
ceph: mds0 10.178.28.98:6800 connection failed
ceph: mds0 10.178.28.98:6800 connection failed
ceph: mon0 10.178.28.97:6789 connection failed
ceph: mon1 10.178.28.98:6789 connection failed
device fsid 9745a95d7084fe84-3f12a96d503909a6 devid 1 transid 2368 /dev/sdb1
ceph: mon0 10.178.28.97:6789 session established
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [<ffffffffa04fc960>] __wake_requests+0x2b/0x64 [ceph]
PGD 34f23f067 PUD 358c33067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 5
Modules linked in: nfsd exportfs loop ceph nfs lockd fscache nfs_acl auth_rpcgss sunrpc autofs4 fuse rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_sa ipv6 ib_uverbs ib_umad mlx4_ib mlx4_core ib_mthca ib_mad ib_core dm_mirror dm_multipath scsi_dh video output sbs sbshc battery acpi_memhotplug ac parport_pc lp parport joydev sg bnx2 dcdbas sr_mod button cdrom tpm_tis serio_raw myri10ge tpm rtc_cmos rtc_core rtc_lib tpm_bios pcspkr i5000_edac edac_core btrfs zlib_deflate dm_region_hash dm_log dm_mod ata_piix libata shpchp megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: freq_table]

Pid: 3016, comm: ceph-msgr/5 Tainted: G W 2.6.34-rc5 #4 0H603H/PowerEdge 2950
RIP: 0010:[<ffffffffa04fc960>] [<ffffffffa04fc960>] __wake_requests+0x2b/0x64 [ceph]
RSP: 0018:ffff88042ba47ce0 EFLAGS: 00010212
RAX: 0000000000000000 RBX: fffffffffffffde0 RCX: ffff88034a6b1a20
RDX: 0000000000000000 RSI: ffff88034a6b1800 RDI: ffff880287928978
RBP: ffff8802879289e8 R08: 0000000000000001 R09: ffffffffa04fecda
R10: ffffffff8175dde0 R11: ffff880436ee8e48 R12: ffff880287928978
R13: ffff880356627e1e R14: ffff88035622b800 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff880001f40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000008 CR3: 000000012d1ac000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ceph-msgr/5 (pid: 3016, threadinfo ffff88042ba46000, task ffff88043cc9b080)
Stack:
ffff880287928980 ffff880287928978 0000000000000078 ffffffffa04fcf67
<0> ffff8801922a9ac0 ffffffff8126316a 0000000000000000 0000000100000009
<0> 0000000000000001 00000000000000a0 000000010000021e ffff880358ce0bc0
Call Trace:
[<ffffffffa04fcf67>] ? ceph_mdsc_handle_map+0x5ce/0x6c0 [ceph]
[<ffffffff8126316a>] ? kernel_sendmsg+0x32/0x3e
[<ffffffff812631ab>] ? kernel_recvmsg+0x35/0x42
[<ffffffff812631ab>] ? kernel_recvmsg+0x35/0x42
[<ffffffffa04ff79b>] ? dispatch+0x323/0x37f [ceph]
[<ffffffffa04fa10b>] ? con_work+0xbdf/0xf08 [ceph]
[<ffffffffa04f952c>] ? con_work+0x0/0xf08 [ceph]
[<ffffffff8104c328>] ? worker_thread+0x146/0x1e0
[<ffffffff8104ecff>] ? autoremove_wake_function+0x0/0x2e
[<ffffffff8104c1e2>] ? worker_thread+0x0/0x1e0
[<ffffffff8104ea0b>] ? kthread+0x79/0x81
[<ffffffff81003654>] ? kernel_thread_helper+0x4/0x10
[<ffffffff8104e992>] ? kthread+0x0/0x81
[<ffffffff81003650>] ? kernel_thread_helper+0x0/0x10
Code: 41 54 49 89 fc 55 48 89 f5 53 48 8b 36 48 81 ee 20 02 00 00 48 8b 9e 20 02 00 00 eb 2f 48 8b 96 20 02 00 00 48 8b 41 08 4c 89 e7 <48> 89 42 08 48 89 10 48 89 49 08 48 89 8e 20 02 00 00 e8 fe f8
RIP [<ffffffffa04fc960>] __wake_requests+0x2b/0x64 [ceph]
RSP <ffff88042ba47ce0>
CR2: 0000000000000008
---[ end trace 4c26dd56ae7be0d3 ]---

History

#1 Updated by Sage Weil almost 14 years ago

  • Target version set to v2.6.35

#2 Updated by Sage Weil almost 14 years ago

  • Status changed from New to Resolved

fixed by 'ceph: fix locking, error paths when waking reconnect requests'

#3 Updated by Sage Weil almost 14 years ago

fixed by commit:1c0806d2caacc683c56a587eaf1502769a7c0698

Also available in: Atom PDF