Project

General

Profile

Actions

Bug #57817

closed

Bug #61200: ceph: corrupt snap message from mds1

Bug #57686: general protection fault and CephFS kernel client hangs after MDS failover

general protection fault and CephFS kernel client hangs after MDS failover

Added by Andreas Teuchert over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I believe that this is the same bug as https://tracker.ceph.com/issues/57686, but in case I'm wrong, I'm opening this separate report.

We have a four-node Ceph cluster (Ceph 17.2.1, Ubuntu 20.04, kernel 5.15.0-48-generic #54~20.04.1-Ubuntu), managed by cephadm that contains two CephFSs. On those nodes, one FS is mounted via the kernel client. There are eight active MDSs and four standby MDSs for the FS.

In one instance of a MDS fail (concurrent fail of three MDSs) the following was logged to dmesg and afterwards accesses to the mounted directory just hang and the node has to be rebooted (reset actually, because the directory can no longer be unmounted):

[Wed Oct  5 15:21:48 2022] ceph: mds0 reconnect start
[Wed Oct  5 15:21:55 2022] ceph: mds1 reconnect start
[Wed Oct  5 15:21:55 2022] ceph: mds4 reconnect start
[Wed Oct  5 15:21:55 2022] ceph: mds4 reconnect success
[Wed Oct  5 15:21:55 2022] ceph: mds0 reconnect success
[Wed Oct  5 15:21:55 2022] ceph: mds1 reconnect success
[Wed Oct  5 15:22:05 2022] ceph: update_snap_trace error -5
[Wed Oct  5 15:22:05 2022] ceph: update_snap_trace error -5
[Wed Oct  5 15:22:07 2022] ceph: mds4 recovery completed
[Wed Oct  5 15:22:07 2022] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP NOPTI
[Wed Oct  5 15:22:07 2022] CPU: 22 PID: 137753 Comm: nfsd Not tainted 5.15.0-48-generic #54~20.04.1-Ubuntu
[Wed Oct  5 15:22:07 2022] Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
[Wed Oct  5 15:22:07 2022] RIP: 0010:ceph_get_snap_realm+0x5e/0x90 [ceph]
[Wed Oct  5 15:22:07 2022] Code: 89 e7 e8 55 62 a1 d6 b8 01 00 00 00 f0 0f c1 43 10 85 c0 75 2a 48 8b 8b 98 00 00 00 48 8b 93 a0 00 00 00 48 8d 83 98 00 00 00 <48> 89 51 08 48 89 0a 48 89 83 98 00 00 00 48 89 83 a0 00 00 00 4c
[Wed Oct  5 15:22:07 2022] RSP: 0018:ffffaeab4193fae0 EFLAGS: 00010246
[Wed Oct  5 15:22:07 2022] RAX: ffff8b9027cf1698 RBX: ffff8b9027cf1600 RCX: dead000000000100
[Wed Oct  5 15:22:07 2022] RDX: dead000000000122 RSI: ffff8b9027cf1600 RDI: ffff8b8a64ae2114
[Wed Oct  5 15:22:07 2022] RBP: ffffaeab4193faf0 R08: ffff8b8a64ae2000 R09: ffff8b8ab66d0120
[Wed Oct  5 15:22:07 2022] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b8a64ae2114
[Wed Oct  5 15:22:07 2022] R13: ffffaeab4193fc48 R14: ffff8b8a64ae2000 R15: ffff8b98a6bc60d8
[Wed Oct  5 15:22:07 2022] FS:  0000000000000000(0000) GS:ffff8be741180000(0000) knlGS:0000000000000000
[Wed Oct  5 15:22:07 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Oct  5 15:22:07 2022] CR2: 000055caf6ce3100 CR3: 000000010d2d8004 CR4: 00000000007706e0
[Wed Oct  5 15:22:07 2022] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Wed Oct  5 15:22:07 2022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Wed Oct  5 15:22:07 2022] PKRU: 55555554
[Wed Oct  5 15:22:07 2022] Call Trace:
[Wed Oct  5 15:22:07 2022]  <TASK>
[Wed Oct  5 15:22:07 2022]  check_quota_exceeded+0x80/0x230 [ceph]
[Wed Oct  5 15:22:07 2022]  ? __cond_resched+0x19/0x40
[Wed Oct  5 15:22:07 2022]  ceph_quota_is_max_bytes_exceeded+0x5d/0x70 [ceph]
[Wed Oct  5 15:22:07 2022]  ceph_write_iter+0x1a3/0x7b0 [ceph]
[Wed Oct  5 15:22:07 2022]  do_iter_readv_writev+0x152/0x1c0
[Wed Oct  5 15:22:07 2022]  do_iter_write+0x8c/0x1d0
[Wed Oct  5 15:22:07 2022]  vfs_iter_write+0x19/0x30
[Wed Oct  5 15:22:07 2022]  nfsd_vfs_write+0x149/0x610 [nfsd]
[Wed Oct  5 15:22:07 2022]  ? nfs4_put_stid+0xfa/0x110 [nfsd]
[Wed Oct  5 15:22:07 2022]  nfsd4_write+0x130/0x1b0 [nfsd]
[Wed Oct  5 15:22:07 2022]  nfsd4_proc_compound+0x3a0/0x770 [nfsd]
[Wed Oct  5 15:22:07 2022]  nfsd_dispatch+0x160/0x260 [nfsd]
[Wed Oct  5 15:22:07 2022]  svc_process_common+0x3d5/0x720 [sunrpc]
[Wed Oct  5 15:22:07 2022]  ? svc_sock_secure_port+0x16/0x40 [sunrpc]
[Wed Oct  5 15:22:07 2022]  ? nfsd_svc+0x390/0x390 [nfsd]
[Wed Oct  5 15:22:07 2022]  svc_process+0xbc/0x100 [sunrpc]
[Wed Oct  5 15:22:07 2022]  nfsd+0xed/0x150 [nfsd]
[Wed Oct  5 15:22:07 2022]  ? nfsd_shutdown_threads+0x90/0x90 [nfsd]
[Wed Oct  5 15:22:07 2022]  kthread+0x127/0x150
[Wed Oct  5 15:22:07 2022]  ? set_kthread_struct+0x50/0x50
[Wed Oct  5 15:22:07 2022]  ret_from_fork+0x1f/0x30
[Wed Oct  5 15:22:07 2022]  </TASK>
[Wed Oct  5 15:22:07 2022] Modules linked in: rpcsec_gss_krb5 tcp_diag udp_diag inet_diag binfmt_misc ceph libceph fscache netfs overlay 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_hl ip6table_filter ip6_tables xt_LOG nf_log_syslog ipt_REJECT nf_reject_ipv4 xt_multiport xt_comment xt_state iptable_filter bpfilter sch_fq_codel intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ast kvm drm_vram_helper crct10dif_pclmul ghash_clmulni_intel drm_ttm_helper ttm aesni_intel drm_kms_helper crypto_simd cec cryptd rc_core i2c_algo_bit rapl irdma ice ib_uverbs intel_cstate ib_core ipmi_ssif fb_sys_fops syscopyarea mei_me ioatdma sysfillrect joydev input_leds sysimgblt mei intel_pch_thermal dca acpi_power_meter acpi_pad mac_hid acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler bonding tls xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 drm nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 raid10 raid456
[Wed Oct  5 15:22:07 2022]  async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear ses enclosure hid_generic raid1 usbhid hid mpt3sas i2c_i801 ahci xhci_pci raid_class crc32_pclmul i40e scsi_transport_sas i2c_smbus lpc_ich libahci xhci_pci_renesas wmi
[Wed Oct  5 15:22:07 2022] ---[ end trace b4a7efa5d0c82dd5 ]---
[Wed Oct  5 15:22:07 2022] RIP: 0010:ceph_get_snap_realm+0x5e/0x90 [ceph]
[Wed Oct  5 15:22:07 2022] Code: 89 e7 e8 55 62 a1 d6 b8 01 00 00 00 f0 0f c1 43 10 85 c0 75 2a 48 8b 8b 98 00 00 00 48 8b 93 a0 00 00 00 48 8d 83 98 00 00 00 <48> 89 51 08 48 89 0a 48 89 83 98 00 00 00 48 89 83 a0 00 00 00 4c
[Wed Oct  5 15:22:07 2022] RSP: 0018:ffffaeab4193fae0 EFLAGS: 00010246
[Wed Oct  5 15:22:07 2022] RAX: ffff8b9027cf1698 RBX: ffff8b9027cf1600 RCX: dead000000000100
[Wed Oct  5 15:22:07 2022] RDX: dead000000000122 RSI: ffff8b9027cf1600 RDI: ffff8b8a64ae2114
[Wed Oct  5 15:22:07 2022] RBP: ffffaeab4193faf0 R08: ffff8b8a64ae2000 R09: ffff8b8ab66d0120
[Wed Oct  5 15:22:07 2022] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8b8a64ae2114
[Wed Oct  5 15:22:07 2022] R13: ffffaeab4193fc48 R14: ffff8b8a64ae2000 R15: ffff8b98a6bc60d8
[Wed Oct  5 15:22:07 2022] FS:  0000000000000000(0000) GS:ffff8be741180000(0000) knlGS:0000000000000000
[Wed Oct  5 15:22:07 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Oct  5 15:22:07 2022] CR2: 000055caf6ce3100 CR3: 000000010d2d8004 CR4: 00000000007706e0
[Wed Oct  5 15:22:07 2022] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[Wed Oct  5 15:22:07 2022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[Wed Oct  5 15:22:07 2022] PKRU: 55555554
[Wed Oct  5 15:22:09 2022] ceph: mds1 recovery completed
[Wed Oct  5 15:22:13 2022] ceph: mds0 recovery completed
[Wed Oct  5 15:23:22 2022] libceph: mds3 (1)10.1.3.141:6803 socket closed (con state OPEN)
[Wed Oct  5 15:23:22 2022] libceph: mds7 (1)10.1.3.141:6801 socket closed (con state OPEN)
[Wed Oct  5 15:23:22 2022] libceph: mds5 (1)10.1.3.140:6805 socket closed (con state OPEN)
[Wed Oct  5 15:23:22 2022] libceph: mds6 (1)10.1.3.140:6807 socket closed (con state OPEN)
[Wed Oct  5 15:23:22 2022] libceph: mds2 (1)10.1.3.140:6801 socket closed (con state OPEN)
[Wed Oct  5 15:23:22 2022] libceph: mds2 (1)10.1.3.140:6801 session reset
[Wed Oct  5 15:23:22 2022] ceph: mds2 closed our session
[Wed Oct  5 15:23:22 2022] ceph: mds2 reconnect start
[Wed Oct  5 15:46:24 2022] ceph: mds6 reconnect start
[Wed Oct  5 15:46:54 2022] libceph: mon1 (1)10.1.3.140:6789 session lost, hunting for new mon
[Wed Oct  5 16:01:24 2022] libceph: mds6 (1)10.1.3.142:6805 socket closed (con state OPEN)

The code path is different from the one from 57686, but in the end it's also a general protection fault in ceph_get_snap_realm (the same one I believe).

In my experience this instance is harder to reproduce. I assume that a write to a file needs to happen in the "wrong" moment to trigger the bug.

Actions #1

Updated by Greg Farnum over 1 year ago

  • Assignee set to Xiubo Li
Actions #2

Updated by Xiubo Li over 1 year ago

  • Status changed from New to Duplicate
  • Parent task set to #57686

This is exactly the same issue with tracker#57686.

Actions

Also available in: Atom PDF