Bug #49210
BUG after network outage then failure to reconnect
0%
Description
Several of our clients failed to reconnect after a network outage on the client side.
[8285696.158169] ceph: get_quota_realm: ino (10004fe5035.fffffffffffffffe) null i_snap_realm ...
Some details about a stuck client are here: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/DETOXXXRM3BRN4UNFUNNA7X4O7A4QZU7/
(Clients are running 3.10.0-1127.19.1.el7.x86_64, servers are running v14.2.11)
Also some of the clients crashed like this:
[8288926.622491] kernel BUG at fs/ceph/mds_client.c:600! [8288926.622736] invalid opcode: 0000 [#1] SMP [8288926.622972] Modules linked in: tcp_diag inet_diag ceph libceph dns_resolver ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi ib_cm nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter mlx4_ib ip6_tables ib_uverbs nf_log_ipv4 ib_core nf_log_common xt_LOG xt_limit xt_pkttype ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_comment xt_multiport xt_conntrack nf_conntrack libcrc32c iptable_filter iTCO_wdt iTCO_vendor_support intel_wmi_thunderbolt sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev mei_me [8288926.625029] i2c_i801 sg mei lpc_ich wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter tty_kraven(OE) openafs(POE) netlog(OE) execlog(OE) secure_log(OE) binfmt_misc ip_tables ext4 mbcache jbd2 mlx4_en raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx4_core ahci drm ixgbe libahci libata crct10dif_pclmul crct10dif_common crc32c_intel mdio ptp devlink pps_core drm_panel_orientation_quirks dca [8288926.627343] CPU: 5 PID: 271979 Comm: kworker/5:2 Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.19.1.el7.x86_64 #1 [8288926.628378] Hardware name: Quanta Computer Inc QuantaPlex T41S-2U/S2S-MB, BIOS S2S_3A19 12/09/2015 [8288926.628946] Workqueue: ceph-msgr ceph_con_workfn [libceph] [8288926.629510] task: ffff92b7238f0000 ti: ffff92bb12d80000 task.ti: ffff92bb12d80000 [8288926.630091] RIP: 0010:[<ffffffffc0e4936a>] [<ffffffffc0e4936a>] __unregister_request+0x1da/0x1e0 [ceph] [8288926.630707] RSP: 0018:ffff92bb12d83b58 EFLAGS: 00010246 [8288926.631315] RAX: 0000000000000000 RBX: ffff92bbc6db0800 RCX: dead000000000200 [8288926.631939] RDX: ffff92bbc6db0b58 RSI: ffff92c3dd6bc500 RDI: ffff92bbc6db0b58 [8288926.632589] RBP: ffff92bb12d83b70 R08: ffff92bbc6db0b58 R09: ffff92bbdc14ec90 [8288926.633243] R10: 00000000000033c9 R11: ffff92b716086300 R12: ffff92bbc6db0808 [8288926.633887] R13: ffff92c3dd6bc400 R14: ffff92c3dd6bc400 R15: 0000000000000000 [8288926.634540] FS: 0000000000000000(0000) GS:ffff92bbdfd40000(0000) knlGS:0000000000000000 [8288926.635207] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [8288926.635894] CR2: 0000555ec1996c10 CR3: 0000000b2e010000 CR4: 00000000001607e0 [8288926.636583] Call Trace: [8288926.637278] [<ffffffffc0e4b3ac>] __do_request+0xac/0x430 [ceph] [8288926.637986] [<ffffffffc0e4b7ba>] __wake_requests+0x8a/0xe0 [ceph] [8288926.638699] [<ffffffffc0e4bcca>] send_mds_reconnect+0x4ba/0x650 [ceph] [8288926.639419] [<ffffffffc0e4be8e>] peer_reset+0x2e/0x40 [ceph] [8288926.640184] [<ffffffffc0d642d9>] try_read+0x829/0x1300 [libceph] [8288926.640956] [<ffffffff9c8e1bfe>] ? account_entity_dequeue+0xae/0xd0 [8288926.641720] [<ffffffff9c8e539c>] ? dequeue_entity+0x11c/0x5c0 [8288926.642509] [<ffffffff9ce33417>] ? kernel_sendmsg+0x37/0x50 [8288926.643321] [<ffffffffc0d64fb4>] ceph_con_workfn+0xe4/0x1530 [libceph] [8288926.644117] [<ffffffff9cf85942>] ? __schedule+0x402/0x840 [8288926.644919] [<ffffffff9c8be6bf>] process_one_work+0x17f/0x440 [8288926.645727] [<ffffffff9c8bf7d6>] worker_thread+0x126/0x3c0 [8288926.646522] [<ffffffff9c8bf6b0>] ? manage_workers.isra.26+0x2a0/0x2a0 [8288926.647326] [<ffffffff9c8c6691>] kthread+0xd1/0xe0 [8288926.648146] [<ffffffff9c8c65c0>] ? insert_kthread_work+0x40/0x40 [8288926.648980] [<ffffffff9cf92d37>] ret_from_fork_nospec_begin+0x21/0x21 [8288926.649819] [<ffffffff9c8c65c0>] ? insert_kthread_work+0x40/0x40 [8288926.650631] Code: 89 85 f8 00 00 00 e9 98 fe ff ff 48 8b 0e 48 89 f2 48 c7 c7 50 d1 e6 c0 48 c7 c6 b8 f1 e5 c0 31 c0 e8 5b 8a d6 db e9 47 fe ff ff <0f> 0b 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 [8288926.652407] RIP [<ffffffffc0e4936a>] __unregister_request+0x1da/0x1e0 [ceph] [8288926.653315] RSP <ffff92bb12d83b58>
The vmcore dmesg is attached. (network outage starts at [8285294])
We also have the vmcore itself if that is useful.
History
#1 Updated by Dan van der Ster about 2 years ago
Maybe these are old bugs just not fixed in el7 kernels?
https://tracker.ceph.com/issues/40339
https://tracker.ceph.com/issues/40340
https://tracker.ceph.com/issues/41551
#2 Updated by Jeff Layton almost 2 years ago
- Status changed from New to Rejected
Yep. RHEL7 lacks some of these fixes. Please feel free to open a bug at bugzilla.redhat.com for this, though be forewarned that RHEL7 is entering maintenance mode and these may not make the cut.
I'll go ahead and close this for now, since this looks like something that is probably already fixed in mainline.