Project

General

Profile

Bug #42661

kernel panic not syncing Fatal exception

Added by Hughen X 4 months ago. Updated 4 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
% Done:

0%

Source:
Tags:
kernel panic,reboot
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature:

Description

Linux Kernel Version: 4.14.74-coreos
Ceph Version: 14.2.3

The libceph client will crash and the system will automatically restart
The log informations which printed kernel stack information before crash, as following:

<1>[42564.946115] BUG: unable to handle kernel NULL pointer dereference at 0000000000000350
<1>[42564.954042] IP: __ceph_remove_cap+0x20/0x210 [ceph]
<6>[42564.959006] PGD 0 P4D 0 
<4>[42564.961625] Oops: 0000 [#1] SMP PTI
<4>[42564.965195] Modules linked in: xfs cbc ceph fscache binfmt_misc xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 veth xt_comment xt_mark nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_conntrack br_netfilter bridge stp llc ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nvidia_uvm(POE) iptable_filter vxlan ip6_udp_tunnel udp_tunnel nvidia(POE) nvidia_drm(POE) overlay nls_ascii nls_cp437 vfat fat coretemp hwmon x86_pkg_temp_thermal kvm_intel ipmi_ssif kvm iTCO_wdt iTCO_vendor_support i2c_i801 mei_me irqbypass evdev mousedev i2c_core mei pcc_cpufreq ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq button sch_fq_codel rbd libceph libcrc32c ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid
<4>[42565.036614]  dm_verity dm_bufio sd_mod crc32c_intel aesni_intel aes_x86_64 xhci_pci i40e crypto_simd ahci cryptd nvme ptp xhci_hcd libahci glue_helper pps_core nvme_core libata usbcore scsi_mod usb_common dm_mirror dm_region_hash dm_log dm_mod dax
<4>[42565.059801] CPU: 2 PID: 46802 Comm: kworker/2:2 Tainted: P           OE   4.14.74-coreos #1
<4>[42565.068819] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 2.0b 02/13/2018
<4>[42565.077675] Workqueue: ceph-msgr ceph_msg_new [libceph]
<4>[42565.083275] task: ffff95d147ee0000 task.stack: ffffbc09e79dc000
<4>[42565.089569] RIP: 0010:__ceph_remove_cap+0x20/0x210 [ceph]
<4>[42565.095340] RSP: 0018:ffffbc09e79dfc10 EFLAGS: 00010282
<4>[42565.100940] RAX: 00000000010dc200 RBX: ffff95e0fd7ca5a0 RCX: 0000000000000000
<4>[42565.108452] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff95e0fd7ca5a0
<4>[42565.115968] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
<4>[42565.123491] R10: 0000000000000d55 R11: ffffffffffffffff R12: ffff95da60e18800
<4>[42565.131029] R13: ffff95e10f8d6918 R14: 0000000000000400 R15: 0000000000000001
<4>[42565.138588] FS:  0000000000000000(0000) GS:ffff95daffa80000(0000) knlGS:0000000000000000
<4>[42565.147358] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[42565.153485] CR2: 0000000000000350 CR3: 0000001a7c20a006 CR4: 00000000007606e0
<4>[42565.160998] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[42565.168514] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[42565.176026] PKRU: 55555554
<4>[42565.179104] Call Trace:
<4>[42565.181933]  ceph_security_xattr_deadlock+0xaff/0x3260 [ceph]
<4>[42565.188058]  ceph_security_xattr_deadlock+0x548/0x3260 [ceph]
<4>[42565.194187]  ? ceph_security_xattr_deadlock+0x890/0x3260 [ceph]
<4>[42565.200496]  ceph_mdsc_handle_mdsmap+0x892/0x1d10 [ceph]
<4>[42565.206215]  ceph_msg_new+0x7d3/0x24d0 [libceph]
<4>[42565.211237]  ? __switch_to_asm+0x40/0x70
<4>[42565.215557]  ? __switch_to_asm+0x34/0x70
<4>[42565.219878]  ? __switch_to_asm+0x40/0x70
<4>[42565.224204]  ? __switch_to_asm+0x34/0x70
<4>[42565.228512]  ? __switch_to_asm+0x40/0x70
<4>[42565.232837]  ? __switch_to_asm+0x34/0x70
<4>[42565.237137]  ? __switch_to+0xa2/0x450
<4>[42565.241174]  ? __switch_to_asm+0x40/0x70
<4>[42565.245473]  ? __switch_to_asm+0x34/0x70
<4>[42565.249771]  ? __switch_to_asm+0x40/0x70
<4>[42565.254072]  process_one_work+0x1da/0x3d0
<4>[42565.258456]  worker_thread+0x2b/0x3f0
<4>[42565.262493]  ? process_one_work+0x3d0/0x3d0
<4>[42565.267066]  kthread+0x11a/0x130
<4>[42565.270669]  ? kthread_create_on_node+0x70/0x70
<4>[42565.275570]  ret_from_fork+0x35/0x40
<4>[42565.279517] Code: 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 56 41 89 f7 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 2f 4c 8b 67 20 <48> 8b 85 50 03 00 00 48 8b 80 08 04 00 00 48 8b 40 28 48 89 04 
<1>[42565.299403] RIP: __ceph_remove_cap+0x20/0x210 [ceph] RSP: ffffbc09e79dfc10
<4>[42565.306658] CR2: 0000000000000350
<4>[42565.310386] ---[ end trace bb8167e8a61f6544 ]---
<0>[42565.863215] Kernel panic - not syncing: Fatal exception
<0>[42565.904530] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

History

#1 Updated by Hughen X 4 months ago

I upgraded the client kernel to 4.19.78, and I encountered a new error stacks.
In addition, I don't quite understand the relationship of this call stack. What happened during ceph_msg_new?

<1>[126678.211422] BUG: unable to handle kernel NULL pointer dereference at 0000000000000368
<6>[126678.219604] PGD 0 P4D 0
<4>[126678.222331] Oops: 0000 [#1] SMP PTI
<4>[126678.226045] CPU: 44 PID: 2068 Comm: kworker/44:1 Tainted: P           OE     4.19.78-coreos #1
<4>[126678.235032] Hardware name: Supermicro SYS-4028GR-TRT2/X10DRG-OT+-CPU, BIOS 2.0c 07/21/2017
<4>[126678.243663] Workqueue: ceph-msgr ceph_msg_new [libceph]
<4>[126678.252844] RIP: 0010:__ceph_remove_cap+0x20/0x200 [ceph]
<4>[126678.262341] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 89 f7 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 2f 4c 8b 67 20 <48> 8b 85 68 03 00 00 48 8b 80 08 04 00 00 48 8b 40 30 48 89 04 24
<4>[126678.290008] RSP: 0000:ffff9aedbbde7bd0 EFLAGS: 00010282
<4>[126678.299725] RAX: 0000000004e702a2 RBX: ffff90208d5b1168 RCX: 0000000000000000
<4>[126678.311372] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff90208d5b1168
<4>[126678.323013] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff98262000
<4>[126678.334575] R10: 0000000000000d55 R11: 0000000000000001 R12: ffff900c3825b000
<4>[126678.346092] R13: ffff9020f65590c8 R14: 0000000000000000 R15: 0000000000000001
<4>[126678.357603] FS:  0000000000000000(0000) GS:ffff902c7f800000(0000) knlGS:0000000000000000
<4>[126678.370188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[126678.380421] CR2: 0000000000000368 CR3: 00000009e120a001 CR4: 00000000003606e0
<4>[126678.392278] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[126678.404064] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[126678.415860] Call Trace:
<4>[126678.423095]  ceph_quota_update_statfs+0x1ce3/0x3530 [ceph]
<4>[126678.433507]  ceph_quota_update_statfs+0x138b/0x3530 [ceph]
<4>[126678.443928]  ? ceph_quota_update_statfs+0x1c20/0x3530 [ceph]
<4>[126678.454706]  ceph_trim_caps+0x73/0x2b0 [ceph]
<4>[126678.464202]  ceph_mdsc_handle_mdsmap+0xd08/0x1b60 [ceph]
<4>[126678.474760]  ? ceph_unarmor+0x38b/0x1650 [libceph]
<4>[126678.484913]  ceph_msg_new+0xdf3/0x2bb0 [libceph]
<4>[126678.494917]  ? __switch_to_asm+0x41/0x70
<4>[126678.504292]  ? __switch_to_asm+0x41/0x70
<4>[126678.513560]  ? __switch_to_asm+0x41/0x70
<4>[126678.522917]  ? __switch_to_asm+0x35/0x70
<4>[126678.532267]  ? __switch_to_asm+0x41/0x70
<4>[126678.541548]  ? __switch_to_asm+0x41/0x70
<4>[126678.550833]  ? __switch_to+0x8c/0x440
<4>[126678.559948]  ? __switch_to_asm+0x35/0x70
<4>[126678.569309]  process_one_work+0x206/0x400
<4>[126678.578884]  worker_thread+0x2d/0x3e0
<4>[126678.588141]  ? process_one_work+0x400/0x400
<4>[126678.598054]  kthread+0x112/0x130
<4>[126678.606982]  ? kthread_bind+0x30/0x30
<4>[126678.616375]  ret_from_fork+0x35/0x40
<4>[126678.625760] Modules linked in: xfs cbc ceph fscache binfmt_misc xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 veth xt_comment xt_mark nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype xt_conntrack br_netfilter bridge stp llc ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter vxlan ip6_udp_tunnel udp_tunnel nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) overlay nvidia_drm(POE) nls_ascii nls_cp437 vfat fat sb_edac edac_core coretemp ipmi_ssif x86_pkg_temp_thermal kvm_intel efi_pstore i2c_i801 kvm mei_me efivars irqbypass mousedev evdev i2c_core mei pcc_cpufreq ipmi_si ipmi_devintf ipmi_msghandler button sch_fq_codel rbd libceph libcrc32c ext4 crc32c_generic crc16 mbcache jbd2 fscrypto dm_verity dm_bufio hid_generic usbhid hid
<4>[126678.745940]  crc32c_intel ahci aesni_intel xhci_pci libahci aes_x86_64 ehci_pci ixgbe crypto_simd xhci_hcd nvme ehci_hcd cryptd hwmon libata glue_helper nvme_core mdio usbcore scsi_mod usb_common dm_mirror dm_region_hash dm_log dm_mod
<4>[126678.783070] CR2: 0000000000000368
<4>[126678.794501] ---[ end trace 050a976ecec5f20e ]---
<4>[126680.237466] RIP: 0010:__ceph_remove_cap+0x20/0x200 [ceph]
<4>[126680.250652] Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 41 89 f7 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 2f 4c 8b 67 20 <48> 8b 85 68 03 00 00 48 8b 80 08 04 00 00 48 8b 40 30 48 89 04 24
<4>[126680.285224] RSP: 0000:ffff9aedbbde7bd0 EFLAGS: 00010282
<4>[126680.298430] RAX: 0000000004e702a2 RBX: ffff90208d5b1168 RCX: 0000000000000000
<4>[126680.313580] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff90208d5b1168
<4>[126680.328686] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff98262000
<4>[126680.343800] R10: 0000000000000d55 R11: 0000000000000001 R12: ffff900c3825b000
<4>[126680.358949] R13: ffff9020f65590c8 R14: 0000000000000000 R15: 0000000000000001
<4>[126680.374144] FS:  0000000000000000(0000) GS:ffff902c7f800000(0000) knlGS:0000000000000000
<4>[126680.390356] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[126680.404112] CR2: 0000000000000368 CR3: 00000009e120a001 CR4: 00000000003606e0
<4>[126680.419220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[126680.434246] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<0>[126680.449242] Kernel panic - not syncing: Fatal exception
<0>[126680.546283] Kernel Offset: 0x17000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

#2 Updated by Ilya Dryomov 4 months ago

  • Category changed from libceph to fs/ceph
  • Assignee set to Jeff Layton

Hughen X wrote:

I upgraded the client kernel to 4.19.78, and I encountered a new error stacks.
In addition, I don't quite understand the relationship of this call stack. What happened during ceph_msg_new?

[...]

These weird stacks appear to be a "feature" of CoreOS kernels, see https://tracker.ceph.com/issues/23706#note-4. Given that the crash is in the filesystem, this is unlikely to be a libceph issue.

#3 Updated by Jeff Layton 4 months ago

Agreed. I can't make much sense of the stack trace either. The offset into __ceph_remove_cap is pretty low, which implies that the crash happened early in that function. That said, I'm not sure how much we can trust that given the other weirdness in there.

If you're able (and have the debuginfo), it might be nice to load the kmod into gdb and see what line it oopsed on:

$ gdb /path/to/ceph.ko
[...]
gdb> list *(__ceph_remove_cap+0x20)

#4 Updated by Jeff Layton 4 months ago

  • Status changed from New to Need More Info

Also available in: Atom PDF