Project

General

Profile

Bug #10793

kernel BUG at net/ceph/messenger.c:2954 (while copying on cephfs)

Added by Nikola Ciprich about 9 years ago. Updated about 9 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
libceph
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

ceph-0.87, kernel 3.18.6 (x86_64), centos 6, three node ceph cluster (3mons, 1mds, 18osds)

while copying data to cephfs storage (quite small amount) I got crash (of the node on which I
was copying data).
backtrace follows:

Feb 8 15:32:06 10.76.13.12 [21326.746313] kernel BUG at net/ceph/messenger.c:2954!
Feb 8 15:32:06 10.76.13.12 [21326.746316] invalid opcode: 0000 [#1] PREEMPT SMP
Feb 8 15:32:06 10.76.13.12 [21326.746364] Modules linked in: cbc ceph libceph fscache drbd lru_cache dlm sctp crc32c_generic libcrc32c configfs netconsole ipmi_devintf rpcs
ec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace sunrpc bridge stp llc 8021q bonding ipv6 ext4 jbd2 crc16 vhost_net macvtap macvlan vhost tun kvm_intel kvm ppdev i
TCO_wdt parport_pc parport rtc_cmos ipmi_si ipmi_msghandler pcspkr i2c_i801 i2c_core 8139too 8139cp mii sg joydev rng_core lpc_ich mfd_core ehci_pci ehci_hcd ioatdma dca i5k
_amb i5000_edac edac_core e1000e ptp pps_core acpi_cpufreq processor thermal_sys hwmon ext3 jbd raid1 usbhid sd_mod aic94xx libsas scsi_transport_sas ahci libahci pata_acpi
ata_generic ata_piix libata scsi_mod uhci_hcd button dm_mirror dm_region_hash dm_log dm_mod
Feb 8 15:32:06 10.76.13.12 [21326.746366] CPU: 7 PID: 780 Comm: kworker/7:1 Not tainted 3.18.6lb6.00_01_PRE01 #1
Feb 8 15:32:06 10.76.13.12 [21326.746367] Hardware name: Supermicro X7DB8/X7DB8, BIOS 2.1a 12/20/2008
Feb 8 15:32:06 10.76.13.12 [21326.746379] Workqueue: ceph-msgr con_work [libceph]
Feb 8 15:32:06 10.76.13.12 [21326.746380] task: ffff8806d595e240 ti: ffff8804a0d54000 task.ti: ffff8804a0d54000
Feb 8 15:32:06 10.76.13.12 [21326.746386] RIP: 0010:[<ffffffffa08310e6>] [<ffffffffa08310e6>] ceph_con_send+0x136/0x150 [libceph]
Feb 8 15:32:06 10.76.13.12 [21326.746387] RSP: 0018:ffff8804a0d57b28 EFLAGS: 00010246
Feb 8 15:32:06 10.76.13.12 [21326.746388] RAX: 0000000000000000 RBX: ffff88080126d030 RCX: ffff880426693020
Feb 8 15:32:06 10.76.13.12 [21326.746389] RDX: ffff88080126d000 RSI: 0000000000000000 RDI: 0000000000000000
Feb 8 15:32:06 10.76.13.12 [21326.746389] RBP: ffff8804a0d57b48 R08: 0000000000000000 R09: 0000000000000000
Feb 8 15:32:06 10.76.13.12 [21326.746390] R10: 0000000000000000 R11: 0000000000000002 R12: ffff880613145a28
Feb 8 15:32:06 10.76.13.12 [21326.746391] R13: ffff88080126d1b8 R14: ffff880802a018b8 R15: ffff880426693000
Feb 8 15:32:06 10.76.13.12 [21326.746392] FS: 0000000000000000(0000) GS:ffff88082fdc0000(0000) knlGS:0000000000000000
Feb 8 15:32:06 10.76.13.12 [21326.746394] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 8 15:32:06 10.76.13.12 [21326.746394] CR2: ffffffffff600400 CR3: 00000008016b0000 CR4: 00000000000027e0

If I can provide any further information, please let me know.

History

#1 Updated by Greg Farnum about 9 years ago

  • Project changed from Ceph to Linux kernel client
  • Category changed from common to libceph
  • Source changed from other to Community (user)

Ilya or Zheng, any ideas?

#2 Updated by Ilya Dryomov about 9 years ago

  • Status changed from New to Need More Info
  • Assignee set to Ilya Dryomov
BUG_ON(msg->con != NULL);
msg->con = con->ops->get(con);
BUG_ON(msg->con == NULL); <-- !!!

but the actual backtrace is missing. Nikola, is that all there is in syslog? Can you send the entire log for that boot?

This is likely a bad osd refcount and so it could be related to #8087. There is a major bug in that area that I'm working on - apart from the fact that we don't really respect it, I have a log from somebody with ceph_osd::o_ref < 0.

#3 Updated by Nikola Ciprich about 9 years ago

Hello Ilya, sorry, unfortunately that's all that got logged (it's netconsole log). I'll try to reproduce the problem and get more info..

Also available in: Atom PDF