Bug #54106
closedLinux kernel client - Bug #55857: strange auth cap state leads to potential deadlock
kclient: hang during workunit cleanup
0%
Description
2022-01-29T02:25:02.917120+00:00 smithi084 kernel: ceph: device name is missing path (no : separator in 0@0e36bce6-80a9-11ec-8c35-001a4aab830c.cephfs=/) 2022-01-29T02:25:02.918680+00:00 smithi084 kernel: libceph: mon0 (2)172.21.15.84:3300 session established 2022-01-29T02:25:02.926179+00:00 smithi084 kernel: libceph: mon0 (2)172.21.15.84:3300 socket closed (con state OPEN) 2022-01-29T02:25:02.941222+00:00 smithi084 kernel: libceph: mon0 (2)172.21.15.84:3300 session lost, hunting for new mon 2022-01-29T02:25:02.949137+00:00 smithi084 kernel: libceph: mon1 (2)172.21.15.112:3300 session established 2022-01-29T02:25:02.949852+00:00 smithi084 kernel: libceph: client15195 fsid 0e36bce6-80a9-11ec-8c35-001a4aab830c 2022-01-29T02:32:50.009527+00:00 smithi084 kernel: WARNING: CPU: 2 PID: 92503 at fs/ceph/caps.c:759 ceph_add_cap+0x638/0x650 [ceph] 2022-01-29T02:32:50.009670+00:00 smithi084 kernel: Modules linked in: ceph libceph dns_resolver veth nft_chain_nat ipt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nft_counter nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_wdt iTCO_vendor_support ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate intel_uncore acpi_ipmi pcspkr i2c_i801 lpc_ich ipmi_si wmi mei_me ipmi_devintf joydev mei ioatdma ipmi_msghandler sunrpc acpi_power_meter acpi_pad fuse dm_mod ip_tables ext4 mbcache jbd2 sd_mod sg ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper nvme ttm igb ixgbe nvme_core drm ahci libahci mdio crc32c_intel i2c_algo_bit libata dca t10_pi 2022-01-29T02:32:50.009729+00:00 smithi084 kernel: CPU: 2 PID: 92503 Comm: kworker/2:1 Kdump: loaded Not tainted 4.18.0-305.el8.x86_64 #1 2022-01-29T02:32:50.009760+00:00 smithi084 kernel: Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 2022-01-29T02:32:50.009792+00:00 smithi084 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] 2022-01-29T02:32:50.009828+00:00 smithi084 kernel: RIP: 0010:ceph_add_cap+0x638/0x650 [ceph] 2022-01-29T02:32:50.009858+00:00 smithi084 kernel: Code: 00 00 00 ff 8b 54 24 04 0f 85 80 fd ff ff 89 54 24 04 bd 01 00 00 00 e9 97 fb ff ff 89 54 24 04 bd 01 00 00 00 e9 7e fd ff ff <0f> 0b e9 dd fb ff ff 0f 0b e9 19 fd ff ff 0f 0b e9 21 fd ff ff 0f 2022-01-29T02:32:50.009890+00:00 smithi084 kernel: RSP: 0018:ffff98dc0a8a79e0 EFLAGS: 00010246 2022-01-29T02:32:50.009923+00:00 smithi084 kernel: RAX: ffff88b436294690 RBX: ffff88b538467c18 RCX: 0000000000000000 2022-01-29T02:32:50.009957+00:00 smithi084 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88b5384678a0 2022-01-29T02:32:50.009988+00:00 smithi084 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 2022-01-29T02:32:50.010020+00:00 smithi084 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000d55 2022-01-29T02:32:50.010056+00:00 smithi084 kernel: R13: ffff88b436294690 R14: ffff88b5384678a0 R15: 0000000000000000 2022-01-29T02:32:50.010093+00:00 smithi084 kernel: FS: 0000000000000000(0000) GS:ffff88b69fc80000(0000) knlGS:0000000000000000 2022-01-29T02:32:50.010155+00:00 smithi084 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2022-01-29T02:32:50.010189+00:00 smithi084 kernel: CR2: 000055b769b3bbc0 CR3: 00000002e7e10002 CR4: 00000000003706e0 2022-01-29T02:32:50.010222+00:00 smithi084 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 2022-01-29T02:32:50.010300+00:00 smithi084 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 2022-01-29T02:32:50.010341+00:00 smithi084 kernel: Call Trace: 2022-01-29T02:32:50.010405+00:00 smithi084 kernel: ? __cap_is_valid+0x1c/0xa0 [ceph] 2022-01-29T02:32:50.010449+00:00 smithi084 kernel: ceph_fill_inode+0xc97/0xec0 [ceph] 2022-01-29T02:32:50.010491+00:00 smithi084 kernel: ? ceph_force_reconnect+0x90/0x90 [ceph] 2022-01-29T02:32:50.010538+00:00 smithi084 kernel: ? bit_wait_io+0x50/0x50 2022-01-29T02:32:50.010587+00:00 smithi084 kernel: ceph_fill_trace+0x14a/0xa80 [ceph] 2022-01-29T02:32:50.010644+00:00 smithi084 kernel: ? ceph_update_snap_trace+0x4c5/0x510 [ceph] 2022-01-29T02:32:50.010687+00:00 smithi084 kernel: handle_reply+0x243/0x7b0 [ceph] 2022-01-29T02:32:50.010727+00:00 smithi084 kernel: mds_dispatch+0xa6/0xbe0 [ceph] 2022-01-29T02:32:50.010758+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60 2022-01-29T02:32:50.010794+00:00 smithi084 kernel: ceph_con_process_message+0x79/0x140 [libceph] 2022-01-29T02:32:50.010842+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60 2022-01-29T02:32:50.010876+00:00 smithi084 kernel: process_message+0xe/0x60 [libceph] 2022-01-29T02:32:50.010906+00:00 smithi084 kernel: ceph_con_v2_try_read+0xc3f/0x1010 [libceph] 2022-01-29T02:32:50.010953+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:32:50.010987+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:32:50.011018+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:32:50.011052+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:32:50.011104+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:32:50.011145+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:32:50.011180+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:32:50.011209+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:32:50.011254+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:32:50.011283+00:00 smithi084 kernel: ceph_con_workfn+0x18f/0x690 [libceph] 2022-01-29T02:32:50.011318+00:00 smithi084 kernel: process_one_work+0x1a7/0x360 2022-01-29T02:32:50.011365+00:00 smithi084 kernel: ? create_worker+0x1a0/0x1a0 2022-01-29T02:32:50.011399+00:00 smithi084 kernel: worker_thread+0x30/0x390 2022-01-29T02:32:50.011430+00:00 smithi084 kernel: ? create_worker+0x1a0/0x1a0 2022-01-29T02:32:50.011489+00:00 smithi084 kernel: kthread+0x116/0x130 2022-01-29T02:32:50.011520+00:00 smithi084 kernel: ? kthread_flush_work_fn+0x10/0x10 2022-01-29T02:32:50.011556+00:00 smithi084 kernel: ret_from_fork+0x35/0x40 2022-01-29T02:32:50.011585+00:00 smithi084 kernel: ---[ end trace 21c87db889c142a1 ]---
From: /ceph/teuthology-archive/pdonnell-2022-01-29_01:47:41-fs:workload-wip-pdonnell-testing-20220127.171526-distro-default-smithi/6648141/remote/smithi084/syslog/kern.log.gz
Updated by Patrick Donnelly about 2 years ago
- Related to Bug #54107: kclient: hang during umount added
Updated by Patrick Donnelly about 2 years ago
- Related to Bug #44100: cephfs rsync kworker high load. added
Updated by Patrick Donnelly about 2 years ago
May be related to #44100, per Jeff.
Updated by Jeff Layton about 2 years ago
First, note that this is an 8.4 kernel which is about a year old. Some of this code has seen s
So, 7 mins between mount and when the warning popped:
2022-01-29T02:25:02.949137+00:00 smithi084 kernel: libceph: mon1 (2)172.21.15.112:3300 session established 2022-01-29T02:25:02.949852+00:00 smithi084 kernel: libceph: client15195 fsid 0e36bce6-80a9-11ec-8c35-001a4aab830c 2022-01-29T02:32:50.009527+00:00 smithi084 kernel: WARNING: CPU: 2 PID: 92503 at fs/ceph/caps.c:759 ceph_add_cap+0x638/0x650 [ceph]
The warning is this in ceph_add_cap:
if (flags & CEPH_CAP_FLAG_AUTH) { ... } else { WARN_ON(ci->i_auth_cap == cap); }
So, while processing a trace, we got issued a cap -- probably one we already had? That new cap didn't have the AUTH flag set, but the ci->i_auth_cap pointer still pointed at it so it must have been the auth mds at one time.
There's this in the log too, which is a similar issue re the auth flag changing.
2022-01-29T02:32:50.961980+00:00 smithi084 kernel: WARNING: CPU: 6 PID: 67864 at fs/ceph/caps.c:3315 handle_cap_grant+0xb9c/0xbb0 [ceph]
After that are a bunch of hung task warnings, mostly from stuff stuck waiting for the snap_rwsem:
2022-01-29T02:36:24.325780+00:00 smithi084 kernel: INFO: task kworker/1:5:33032 blocked for more than 120 seconds. 2022-01-29T02:36:24.325882+00:00 smithi084 kernel: Tainted: G W --------- - - 4.18.0-305.el8.x86_64 #1 2022-01-29T02:36:24.325912+00:00 smithi084 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2022-01-29T02:36:24.333977+00:00 smithi084 kernel: kworker/1:5 D 0 33032 2 0x80004080 2022-01-29T02:36:24.339831+00:00 smithi084 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] 2022-01-29T02:36:24.345703+00:00 smithi084 kernel: Call Trace: 2022-01-29T02:36:24.348537+00:00 smithi084 kernel: __schedule+0x2c4/0x700 2022-01-29T02:36:24.352404+00:00 smithi084 kernel: schedule+0x38/0xa0 2022-01-29T02:36:24.355890+00:00 smithi084 kernel: rwsem_down_read_slowpath+0x360/0x3d0 2022-01-29T02:36:24.361008+00:00 smithi084 kernel: ceph_check_caps+0x38a/0xa60 [ceph] 2022-01-29T02:36:24.365886+00:00 smithi084 kernel: ? ktime_get_with_offset+0x4d/0xc0 2022-01-29T02:36:24.370679+00:00 smithi084 kernel: ? sg_init_table+0x11/0x30 2022-01-29T02:36:24.374768+00:00 smithi084 kernel: ? scatterwalk_map_and_copy+0x4e/0x70 2022-01-29T02:36:24.379798+00:00 smithi084 kernel: ? __cap_is_valid+0x1c/0xa0 [ceph] 2022-01-29T02:36:24.384578+00:00 smithi084 kernel: ? __ceph_caps_issued+0x6f/0xd0 [ceph] 2022-01-29T02:36:24.389711+00:00 smithi084 kernel: handle_cap_grant+0x977/0xbb0 [ceph] 2022-01-29T02:36:24.394670+00:00 smithi084 kernel: ceph_handle_caps+0xce4/0x1820 [ceph] 2022-01-29T02:36:24.399708+00:00 smithi084 kernel: mds_dispatch+0x136/0xbe0 [ceph] 2022-01-29T02:36:24.404306+00:00 smithi084 kernel: ? setup_message_sgs+0x208/0x360 [libceph] 2022-01-29T02:36:24.409775+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60 2022-01-29T02:36:24.413910+00:00 smithi084 kernel: ceph_con_process_message+0x79/0x140 [libceph] 2022-01-29T02:36:24.419719+00:00 smithi084 kernel: ? sg_miter_start+0x60/0x60 2022-01-29T02:36:24.423900+00:00 smithi084 kernel: process_message+0xe/0x60 [libceph] 2022-01-29T02:36:24.428755+00:00 smithi084 kernel: ceph_con_v2_try_read+0xc3f/0x1010 [libceph] 2022-01-29T02:36:24.434400+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:36:24.438657+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:36:24.442894+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:36:24.447128+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:36:24.451305+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:36:24.455550+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:36:24.459746+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:36:24.463939+00:00 smithi084 kernel: ? __switch_to_asm+0x41/0x70 2022-01-29T02:36:24.468137+00:00 smithi084 kernel: ? __switch_to_asm+0x35/0x70 2022-01-29T02:36:24.472350+00:00 smithi084 kernel: ceph_con_workfn+0x18f/0x690 [libceph] 2022-01-29T02:36:24.477440+00:00 smithi084 kernel: process_one_work+0x1a7/0x360 2022-01-29T02:36:24.481737+00:00 smithi084 kernel: worker_thread+0x30/0x390 2022-01-29T02:36:24.485684+00:00 smithi084 kernel: ? create_worker+0x1a0/0x1a0 2022-01-29T02:36:24.489895+00:00 smithi084 kernel: kthread+0x116/0x130 2022-01-29T02:36:24.493449+00:00 smithi084 kernel: ? kthread_flush_work_fn+0x10/0x10 2022-01-29T02:36:24.498187+00:00 smithi084 kernel: ret_from_fork+0x35/0x40
Note that these are read rwsem attempts, so that implies that something was holding it for write and not releasing it.
I'll also note that this is in there too:
2022-01-29T02:37:48.218123+00:00 smithi084 kernel: ceph: mds0 hung
...but that gets set under the session->s_mutex so it's likely that that's just fallout from all of the hung tasks.
Beyond that, I can't tell much. I don't suppose we have a vmcore from this do we?
Updated by Jeff Layton almost 2 years ago
- Assignee changed from Jeff Layton to Xiubo Li
Updated by Xiubo Li almost 2 years ago
- Status changed from New to Duplicate
- Parent task set to #55857
This is duplicated to https://tracker.ceph.com/issues/55857.