Project

General

Profile

Bug #55823

NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed out

Added by Xiubo Li 6 months ago. Updated 6 months ago.

Status:
New
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
kcephfs
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/vshankar-2022-05-13_17:09:16-fs-wip-vshankar-testing-20220513-120051-testing-default-smithi/6832877/console_logs/smithi117.log

http://qa-proxy.ceph.com/teuthology/vshankar-2022-05-31_02:47:51-fs-wip-vshankar-fscrypt-20220530-091336-testing-default-smithi/6853733/console_logs/smithi025.log

The teuthology test failed without any logs in remote/ direcotry and could see the following call traces:

[ 1081.462059] ------------[ cut here ]------------
[ 1081.466857] NETDEV WATCHDOG: enp3s0f1 (ixgbe): transmit queue 3 timed out
[ 1081.473785] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:530 dev_watchdog+0x20f/0x250
[ 1081.482199] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt ipmi_ssif iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel i2c_i801 lpc_ich wmi i2c_smbus mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mei_me mei ioatdma ixgbe igb mdio i2c_algo_bit crc32c_intel ptp nvme pps_core dca nvme_core fuse
[ 1081.537034] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G S                5.18.0-rc6-ceph-gb850759e370d #1
[ 1081.546475] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[ 1081.553993] RIP: 0010:dev_watchdog+0x20f/0x250
[ 1081.558537] Code: 00 e9 4d ff ff ff 48 89 df c6 05 31 96 f6 00 01 e8 c6 fb f9 ff 44 89 e9 48 89 de 48 c7 c7 70 e2 41 82 48 89 c2 e8 8b 19 19 00 <0f> 0b e9 2f ff ff ff e8 f5 05 1e 00 85 c0 74 b5 80 3d f8 8e f6 00
[ 1081.577478] RSP: 0018:ffffc900001b4e80 EFLAGS: 00010282
[ 1081.582807] RAX: 0000000000000000 RBX: ffff88817b2c0000 RCX: 0000000000000103
[ 1081.590042] RDX: 0000000000000103 RSI: ffffffff823936a5 RDI: 00000000ffffffff
[ 1081.597290] RBP: ffff88817b2c0508 R08: 0000000000000000 R09: c0000000ffffdfff
[ 1081.604539] R10: 0000000000000001 R11: ffffc900001b4ca0 R12: ffff88817b2c0420
[ 1081.611786] R13: 0000000000000003 R14: ffff88817b2c0508 R15: ffff88885fcdd480
[ 1081.619027] FS:  0000000000000000(0000) GS:ffff88885fcc0000(0000) knlGS:0000000000000000
[ 1081.627257] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1081.633108] CR2: 000055ebf3d75000 CR3: 0000000002612004 CR4: 00000000003706e0
[ 1081.640347] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1081.647590] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1081.654828] Call Trace:
[ 1081.657366]  <IRQ>
[ 1081.659474]  ? mq_change_real_num_tx+0xd0/0xd0
[ 1081.664022]  call_timer_fn+0xa1/0x2c0
[ 1081.667794]  ? mq_change_real_num_tx+0xd0/0xd0
[ 1081.672345]  run_timer_softirq+0x50d/0x530
[ 1081.676547]  ? lock_is_held_type+0xe3/0x140
[ 1081.680842]  __do_softirq+0xdc/0x483
[ 1081.684522]  irq_exit_rcu+0xdc/0x120
[ 1081.688197]  sysvec_apic_timer_interrupt+0x9e/0xc0
[ 1081.693093]  </IRQ>
[ 1081.695281]  <TASK>
[ 1081.697476]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1081.702714] RIP: 0010:cpuidle_enter_state+0x10a/0x4b0
[ 1081.707868] Code: 6c dc 8f ff 45 84 ff 74 16 9c 58 0f 1f 40 00 f6 c4 02 0f 85 43 03 00 00 31 ff e8 c1 51 96 ff e8 5c f5 9d ff fb 0f 1f 44 00 00 <45> 85 f6 0f 88 12 01 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d
[ 1081.726813] RSP: 0018:ffffc9000010be88 EFLAGS: 00000206
[ 1081.732137] RAX: 000000000073c7f9 RBX: 0000000000000004 RCX: 0000000000000000
[ 1081.739373] RDX: 0000000000000000 RSI: ffffffff8240e25f RDI: ffffffff823a6b0f
[ 1081.746609] RBP: ffffe8ffffcc3a00 R08: 0000000000000001 R09: 0000000000000001
[ 1081.753847] R10: 000004ae9b8e4927 R11: ffff88885fcec1c4 R12: 000000fbcc2958a1
[ 1081.761084] R13: ffffffff827ed4c0 R14: 0000000000000004 R15: 0000000000000000
[ 1081.768339]  ? cpuidle_enter_state+0x104/0x4b0
[ 1081.772890]  cpuidle_enter+0x29/0x40
[ 1081.776568]  do_idle+0x245/0x260
[ 1081.787555]  cpu_startup_entry+0x19/0x20
[ 1081.791579]  start_secondary+0x106/0x130
[ 1081.795599]  secondary_startup_64_no_verify+0xd5/0xdb
[ 1081.800769]  </TASK>
[ 1081.803045] irq event stamp: 7588692
[ 1081.806724] hardirqs last  enabled at (7588702): [<ffffffff811607de>] __up_console_sem+0x5e/0x70
[ 1081.815643] hardirqs last disabled at (7588711): [<ffffffff811607c3>] __up_console_sem+0x43/0x70
[ 1081.824571] softirqs last  enabled at (7587812): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1081.833317] softirqs last disabled at (7587835): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[ 1081.841969] ---[ end trace 0000000000000000 ]---
[ 1081.846718] ixgbe 0000:03:00.1 enp3s0f1: initiating reset due to tx timeout
[ 1081.854216] ixgbe 0000:03:00.1 enp3s0f1: Reset adapter
[ 1085.558440] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1085.822639] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1086.334273] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1087.414306] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1089.398379] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1093.430734] libceph: mon1 (1)172.21.15.117:6789 socket error on write
[ 1093.954303] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [fsstress:89394]
[ 1093.959305] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [fsstress:89395]
[ 1093.960304] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [fsstress:89392]
[ 1093.960308] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt ipmi_ssif iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel i2c_i801 lpc_ich wmi i2c_smbus mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mei_me mei ioatdma ixgbe igb mdio i2c_algo_bit crc32c_intel ptp nvme pps_core dca nvme_core fuse
[ 1093.960358] irq event stamp: 50920
[ 1093.960359] hardirqs last  enabled at (50919): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1093.960364] hardirqs last disabled at (50920): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[ 1093.960367] softirqs last  enabled at (1840): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1093.960370] softirqs last disabled at (1807): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[ 1093.960372] CPU: 2 PID: 89392 Comm: fsstress Tainted: G S      W         5.18.0-rc6-ceph-gb850759e370d #1
[ 1093.960375] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[ 1093.960376] RIP: 0010:queued_spin_lock_slowpath+0x1b3/0x1e0
[ 1093.960381] Code: 83 e0 03 83 e9 01 48 c1 e0 04 48 63 c9 48 05 40 e5 02 00 48 03 04 cd 20 0b 43 82 48 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 <85> c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 e9 63 ff
[ 1093.960382] RSP: 0018:ffffc900017f3e38 EFLAGS: 00000246
[ 1093.960384] RAX: 0000000000000000 RBX: ffff88817bbfc4b0 RCX: 0000000000000007
[ 1093.960385] RDX: ffff88885fcae540 RSI: ffffffff823936a5 RDI: ffffffff823a6afe
[ 1093.960386] RBP: ffff88885fcae540 R08: 00000000000052ef R09: 0000000000000000
[ 1093.960387] R10: 0000000000000001 R11: c359abbe014cc9bb R12: 00000000000c0000
[ 1093.960388] R13: ffff88817b7d3000 R14: ffff88817b7d3998 R15: ffffffffa089c2f3
[ 1093.960390] FS:  00007f4103475500(0000) GS:ffff88885fc80000(0000) knlGS:0000000000000000
[ 1093.960391] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1093.960392] CR2: 000055ebf31cd300 CR3: 00000001d35fc002 CR4: 00000000003706e0
[ 1093.960393] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1093.960394] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1093.960396] Call Trace:
[ 1093.960397]  <TASK>
[ 1093.960400]  do_raw_spin_lock+0xb1/0xc0
[ 1093.960403]  flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1093.960422]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[ 1093.960434]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[ 1093.960449]  ? __x64_sys_tee+0xc0/0xc0
[ 1093.960453]  ceph_sync_fs+0x9b/0x100 [ceph]
[ 1093.960462]  iterate_supers+0x93/0xe0
[ 1093.960468]  ksys_sync+0x4f/0xa0
[ 1093.960471]  __do_sys_sync+0xa/0x10
[ 1093.960474]  do_syscall_64+0x34/0x80
[ 1093.960477]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1093.960481] RIP: 0033:0x7f4102ef44fb
[ 1093.960483] Code: c3 48 8b 0d 8f a9 2c 00 f7 d8 64 89 01 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d a9 2c 00 f7 d8 64 89 01 48
[ 1093.960485] RSP: 002b:00007ffc90169858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1093.960487] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007f4102ef44fb
[ 1093.960488] RDX: 000000007f123e51 RSI: 000000007f123e51 RDI: 000000000000000f
[ 1093.960489] RBP: 00007ffc90169870 R08: 00007f41031bf200 R09: 00007f41031bf260
[ 1093.960490] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000401760
[ 1093.960491] R13: 00007ffc90169bb0 R14: 0000000000000000 R15: 0000000000000000
[ 1093.960502]  </TASK>
[ 1093.961671] Modules linked in: ceph
[ 1093.962303] watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [fsstress:89393]
[ 1093.962306] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt ipmi_ssif iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel i2c_i801 lpc_ich wmi i2c_smbus mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mei_me mei ioatdma ixgbe igb mdio i2c_algo_bit crc32c_intel ptp nvme pps_core dca nvme_core fuse
[ 1093.962374] irq event stamp: 54074
[ 1093.962375] hardirqs last  enabled at (54073): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1093.962379] hardirqs last disabled at (54074): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[ 1093.962383] softirqs last  enabled at (5108): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1093.962386] softirqs last disabled at (5089): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[ 1093.962389] CPU: 4 PID: 89393 Comm: fsstress Tainted: G S      W    L    5.18.0-rc6-ceph-gb850759e370d #1
[ 1093.962392] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[ 1093.962394] RIP: 0010:queued_spin_lock_slowpath+0x1b3/0x1e0
[ 1093.962398] Code: 83 e0 03 83 e9 01 48 c1 e0 04 48 63 c9 48 05 40 e5 02 00 48 03 04 cd 20 0b 43 82 48 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 <85> c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 e9 63 ff
[ 1093.962400] RSP: 0018:ffffc900016b3e38 EFLAGS: 00000246
[ 1093.962403] RAX: 0000000000000000 RBX: ffff88817bbfc4b0 RCX: 0000000000000001
[ 1093.962405] RDX: ffff88885fd2e540 RSI: ffffffff823936a5 RDI: ffffffff823a6afe
[ 1093.962406] RBP: ffff88885fd2e540 R08: 00000000000052ef R09: 0000000000000000
[ 1093.962408] R10: 0000000000000001 R11: c359abbe014cc9bb R12: 0000000000140000
[ 1093.962410] R13: ffff88817b7d3000 R14: ffff88817b7d3998 R15: ffffffffa089c2f3
[ 1093.962412] FS:  00007f4103475500(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
[ 1093.962414] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1093.962416] CR2: 000055efcbcd2000 CR3: 000000011a0b8005 CR4: 00000000003706e0
[ 1093.962418] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1093.962420] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1093.962421] Call Trace:
[ 1093.962423]  <TASK>
[ 1093.962427]  do_raw_spin_lock+0xb1/0xc0
[ 1093.962432]  flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1093.962453]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[ 1093.962467]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[ 1093.962487]  ? __x64_sys_tee+0xc0/0xc0
[ 1093.962490]  ceph_sync_fs+0x9b/0x100 [ceph]
[ 1093.962502]  iterate_supers+0x93/0xe0
[ 1093.962510]  ksys_sync+0x4f/0xa0
[ 1093.962514]  __do_sys_sync+0xa/0x10
[ 1093.962517]  do_syscall_64+0x34/0x80
[ 1093.962521]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1093.962525] RIP: 0033:0x7f4102ef44fb
[ 1093.962528] Code: c3 48 8b 0d 8f a9 2c 00 f7 d8 64 89 01 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d a9 2c 00 f7 d8 64 89 01 48
[ 1093.962530] RSP: 002b:00007ffc90169858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1093.962533] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007f4102ef44fb
[ 1093.962535] RDX: 000000000d0cb5a8 RSI: 000000000d0cb5a8 RDI: 000000000000000b
[ 1093.962537] RBP: 00007ffc90169870 R08: 00007f41031bf1e4 R09: 00007f41031bf260
[ 1093.962539] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000401760
[ 1093.962540] R13: 00007ffc90169bb0 R14: 0000000000000000 R15: 0000000000000000
[ 1093.962560]  </TASK>
[ 1093.965303] watchdog: BUG: soft lockup - CPU#5 stuck for 26s! [fsstress:89400]
[ 1093.965306] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt ipmi_ssif iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel i2c_i801 lpc_ich wmi i2c_smbus mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mei_me mei ioatdma ixgbe igb mdio i2c_algo_bit crc32c_intel ptp nvme pps_core dca nvme_core fuse
[ 1093.965362] irq event stamp: 51000
[ 1093.965363] hardirqs last  enabled at (50999): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1093.965367] hardirqs last disabled at (51000): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[ 1093.965370] softirqs last  enabled at (1608): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1093.965372] softirqs last disabled at (1589): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[ 1093.965375] CPU: 5 PID: 89400 Comm: fsstress Tainted: G S      W    L    5.18.0-rc6-ceph-gb850759e370d #1
[ 1093.965377] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[ 1093.965378] RIP: 0010:queued_spin_lock_slowpath+0x1b0/0x1e0
[ 1093.965383] Code: c1 e9 12 83 e0 03 83 e9 01 48 c1 e0 04 48 63 c9 48 05 40 e5 02 00 48 03 04 cd 20 0b 43 82 48 89 10 8b 42 08 85 c0 75 09 f3 90 <8b> 42 08 85 c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09
[ 1093.965385] RSP: 0018:ffffc90001893e38 EFLAGS: 00000246
[ 1093.965387] RAX: 0000000000000000 RBX: ffff88817bbfc4b0 RCX: 0000000000000000
[ 1093.965389] RDX: ffff88885fd6e540 RSI: ffffffff823936a5 RDI: ffffffff823a6afe
[ 1093.965390] RBP: ffff88885fd6e540 R08: 00000000000052ef R09: 0000000000000000
[ 1093.965392] R10: 0000000000000001 R11: c359abbe014cc9bb R12: 0000000000180000
[ 1093.965393] R13: ffff88817b7d3000 R14: ffff88817b7d3998 R15: ffffffffa089c2f3
[ 1093.965395] FS:  00007f4103475500(0000) GS:ffff88885fd40000(0000) knlGS:0000000000000000
[ 1093.965397] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1093.965398] CR2: 000055a818759000 CR3: 00000001bb392004 CR4: 00000000003706e0
[ 1093.965400] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1093.965402] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1093.965403] Call Trace:
[ 1093.965404]  <TASK>
[ 1093.965409]  do_raw_spin_lock+0xb1/0xc0
[ 1093.965413]  flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1093.965431]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[ 1093.965443]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[ 1093.965461]  ? __x64_sys_tee+0xc0/0xc0
[ 1093.965464]  ceph_sync_fs+0x9b/0x100 [ceph]
[ 1093.965475]  iterate_supers+0x93/0xe0
[ 1093.965482]  ksys_sync+0x4f/0xa0
[ 1093.965486]  __do_sys_sync+0xa/0x10
[ 1093.965488]  do_syscall_64+0x34/0x80
[ 1093.965492]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1093.965495] RIP: 0033:0x7f4102ef44fb
[ 1093.965498] Code: c3 48 8b 0d 8f a9 2c 00 f7 d8 64 89 01 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d a9 2c 00 f7 d8 64 89 01 48
[ 1093.965500] RSP: 002b:00007ffc90169858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1093.965503] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007f4102ef44fb
[ 1093.965504] RDX: 0000000030ea7f1b RSI: 0000000030ea7f1b RDI: 000000000000000d
[ 1093.965505] RBP: 00007ffc90169870 R08: 00007f41031bf1f4 R09: 00007f41031bf260
[ 1093.965507] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000401760
[ 1093.965508] R13: 00007ffc90169bb0 R14: 0000000000000000 R15: 0000000000000000
[ 1093.965524]  </TASK>
[ 1093.967304] watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [fsstress:89397]
[ 1093.967308] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt ipmi_ssif iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul joydev ghash_clmulni_intel i2c_i801 lpc_ich wmi i2c_smbus mfd_core acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mei_me mei ioatdma ixgbe igb mdio i2c_algo_bit crc32c_intel ptp nvme pps_core dca nvme_core fuse
[ 1093.967383] irq event stamp: 56042
[ 1093.967384] hardirqs last  enabled at (56041): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 1093.967388] hardirqs last disabled at (56042): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[ 1093.967392] softirqs last  enabled at (55992): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[ 1093.967395] softirqs last disabled at (55987): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[ 1093.967398] CPU: 7 PID: 89397 Comm: fsstress Tainted: G S      W    L    5.18.0-rc6-ceph-gb850759e370d #1
[ 1093.967401] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
[ 1093.967403] RIP: 0010:queued_spin_lock_slowpath+0x1b3/0x1e0
[ 1093.967408] Code: 83 e0 03 83 e9 01 48 c1 e0 04 48 63 c9 48 05 40 e5 02 00 48 03 04 cd 20 0b 43 82 48 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 <85> c0 74 f7 48 8b 0a 48 85 c9 0f 84 6b ff ff ff 0f 0d 09 e9 63 ff
[ 1093.967410] RSP: 0018:ffffc9000187be38 EFLAGS: 00000246
[ 1093.967414] RAX: 0000000000000000 RBX: ffff88817bbfc4b0 RCX: 0000000000000005
[ 1093.967415] RDX: ffff88885fdee540 RSI: ffffffff823936a5 RDI: ffffffff823a6afe
[ 1093.967417] RBP: ffff88885fdee540 R08: 00000000000052ef R09: 0000000000000000
[ 1093.967419] R10: 0000000000000001 R11: c359abbe014cc9bb R12: 0000000000200000
[ 1093.967421] R13: ffff88817b7d3000 R14: ffff88817b7d3998 R15: ffffffffa089c2f3
[ 1093.967423] FS:  00007f4103475500(0000) GS:ffff88885fdc0000(0000) knlGS:0000000000000000
[ 1093.967426] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1093.967428] CR2: 0000560366ba9320 CR3: 00000001bcad2004 CR4: 00000000003706e0
[ 1093.967430] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1093.967432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1093.967434] Call Trace:
[ 1093.967435]  <TASK>
[ 1093.967440]  do_raw_spin_lock+0xb1/0xc0
[ 1093.967446]  flush_dirty_session_caps+0x82/0x150 [ceph]
[ 1093.967469]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[ 1093.967484]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[ 1093.967506]  ? __x64_sys_tee+0xc0/0xc0
[ 1093.967509]  ceph_sync_fs+0x9b/0x100 [ceph]
[ 1093.967524]  iterate_supers+0x93/0xe0
[ 1093.967532]  ksys_sync+0x4f/0xa0
[ 1093.967537]  __do_sys_sync+0xa/0x10
[ 1093.967541]  do_syscall_64+0x34/0x80
[ 1093.967545]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1093.967549] RIP: 0033:0x7f4102ef44fb
[ 1093.967553] Code: c3 48 8b 0d 8f a9 2c 00 f7 d8 64 89 01 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 a2 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d a9 2c 00 f7 d8 64 89 01 48
[ 1093.967555] RSP: 002b:00007ffc90169858 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1093.967558] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007f4102ef44fb
[ 1093.967560] RDX: 000000003783312f RSI: 000000003783312f RDI: 000000000000000f
[ 1093.967562] RBP: 00007ffc90169870 R08: 00007f41031bf204 R09: 00007f41031bf260
[ 1093.967564] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000401760
[ 1093.967566] R13: 00007ffc90169bb0 R14: 0000000000000000 R15: 0000000000000000
[ 1093.967587]  </TASK>

[13220.949256] watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [umount:203080]
[13220.956508] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13221.012949] irq event stamp: 62390
[13221.016438] hardirqs last  enabled at (62389): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13221.026466] hardirqs last disabled at (62390): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13221.036057] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13221.044600] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13221.053062] CPU: 0 PID: 203080 Comm: umount Tainted: G S                5.18.0-ceph-g342bda2e5287 #1
[13221.062302] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13221.069890] RIP: 0010:do_raw_spin_lock+0x3e/0xc0
[13221.074595] Code: 8b 53 10 65 48 8b 04 25 00 be 01 00 48 39 c2 74 4e 8b 53 08 65 8b 05 f1 ef eb 7e 39 c2 74 5a 31 c0 ba 01 00 00 00 f0 0f b1 13 <75> 67 65 8b 05 d9 ef eb 7e 89 43 08 65 48 8b 04 25 00 be 01 00 48
[13221.093501] RSP: 0018:ffffc90001b77c30 EFLAGS: 00000246
[13221.098810] RAX: 0000000000000000 RBX: ffff88817f1b51a0 RCX: 0000000000000000
[13221.106033] RDX: 0000000000000001 RSI: ffffffff8233333d RDI: ffff88817f1b51a0
[13221.113252] RBP: 0000000000000002 R08: 00000000000037ab R09: 0000000000000000
[13221.120474] R10: 0000000000000001 R11: 917bf5a7cab73d6b R12: ffff88810bfb24b0
[13221.127692] R13: ffff888235830000 R14: ffff888235830998 R15: ffffffffa0982353
[13221.134914] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13221.143107] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13221.148939] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13221.156162] Call Trace:
[13221.158689]  <TASK>
[13221.160875]  ceph_check_caps+0x68/0x9e0 [ceph]
[13221.165427]  ? __lock_acquire+0xa25/0x1d80
[13221.169611]  ? lock_is_held_type+0xe3/0x140
[13221.173879]  ? find_held_lock+0x2d/0x90
[13221.177804]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13221.183294]  ? lock_release+0x13d/0x2b0
[13221.187220]  flush_dirty_session_caps+0x72/0x150 [ceph]
[13221.192544]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13221.197689]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13221.203100]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13221.208074]  ceph_kill_sb+0x1c/0x90 [ceph]
[13221.212263]  deactivate_locked_super+0x29/0x60
[13221.216797]  cleanup_mnt+0xb8/0x140
[13221.220375]  task_work_run+0x6d/0xb0
[13221.224042]  exit_to_user_mode_prepare+0x226/0x230
[13221.228922]  syscall_exit_to_user_mode+0x25/0x60
[13221.233627]  do_syscall_64+0x40/0x80
[13221.237292]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13221.242435] RIP: 0033:0x7f11c7cfcdfb
[13221.246096] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13221.264998] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13221.272672] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13221.279897] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13221.287117] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13221.294336] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13221.301554] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13221.308785]  </TASK>
[13248.949436] watchdog: BUG: soft lockup - CPU#0 stuck for 53s! [umount:203080]
[13248.956669] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13249.013104] irq event stamp: 118000
[13249.016674] hardirqs last  enabled at (117999): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13249.026788] hardirqs last disabled at (118000): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13249.036460] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13249.045001] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13249.053465] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13249.062704] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13249.070290] RIP: 0010:rcu_is_watching+0x1/0x40
[13249.074824] Code: 92 00 e9 0d ff ff ff 66 90 0f 1f 44 00 00 48 c7 c7 45 e3 40 82 e8 9f ca 96 00 65 8a 05 7c 7d eb 7e 0f be c0 c3 0f 1f 40 00 53 <65> ff 05 98 50 ea 7e e8 63 ca 96 00 48 c7 c3 d0 e6 02 00 89 c0 48
[13249.093730] RSP: 0018:ffffc90001b77d20 EFLAGS: 00000202
[13249.099038] RAX: 0000000000000001 RBX: ffffc90001b77d47 RCX: 0000000000000000
[13249.106264] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90001b77d47
[13249.113484] RBP: ffff88810bfb24c8 R08: 0000000000000001 R09: 0000000000000000
[13249.120701] R10: 0000000000000001 R11: 917bf5a7cab73d6b R12: 0000000000000000
[13249.127921] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[13249.135140] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13249.143333] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13249.149167] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13249.156390] Call Trace:
[13249.158920]  <TASK>
[13249.161104]  rcu_read_lock_held_common+0x21/0x40
[13249.165810]  rcu_read_lock_sched_held+0x23/0x90
[13249.170427]  lock_acquire+0x19c/0x2f0
[13249.174176]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13249.179676]  ? lock_release+0x13d/0x2b0
[13249.183600]  _raw_spin_lock+0x2f/0x40
[13249.187348]  ? flush_dirty_session_caps+0x82/0x150 [ceph]
[13249.192841]  flush_dirty_session_caps+0x82/0x150 [ceph]
[13249.198167]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13249.203319]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13249.208728]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13249.213701]  ceph_kill_sb+0x1c/0x90 [ceph]
[13249.217891]  deactivate_locked_super+0x29/0x60
[13249.222418]  cleanup_mnt+0xb8/0x140
[13249.225995]  task_work_run+0x6d/0xb0
[13249.229663]  exit_to_user_mode_prepare+0x226/0x230
[13249.234544]  syscall_exit_to_user_mode+0x25/0x60
[13249.239249]  do_syscall_64+0x40/0x80
[13249.242914]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13249.248054] RIP: 0033:0x7f11c7cfcdfb
[13249.251718] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13249.270615] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13249.278291] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13249.285517] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13249.292736] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13249.299955] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13249.307175] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13249.314403]  </TASK>
[13260.774515] rcu: INFO: rcu_preempt self-detected stall on CPU
[13260.780352] rcu:     0-....: (64270 ticks this GP) idle=75d/1/0x4000000000000000 softirq=685569/685569 fqs=16199 
[13260.790851]     (t=65015 jiffies g=1444649 q=2645)
[13260.795471] NMI backtrace for cpu 0
[13260.799047] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13260.808286] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13260.823595] Call Trace:
[13260.826131]  <IRQ>
[13260.828224]  dump_stack_lvl+0x55/0x6d
[13260.831974]  nmi_cpu_backtrace.cold.4+0x30/0x7c
[13260.836588]  ? lapic_can_unplug_cpu+0x70/0x70
[13260.841032]  nmi_trigger_cpumask_backtrace+0xc9/0xe0
[13260.846089]  rcu_dump_cpu_stacks+0xce/0x150
[13260.850360]  rcu_sched_clock_irq.cold.70+0x207/0x419
[13260.855415]  ? lock_is_held_type+0xe3/0x140
[13260.859683]  ? lock_is_held_type+0xe3/0x140
[13260.863960]  ? tick_sched_do_timer+0xa0/0xa0
[13260.868313]  update_process_times+0x93/0xc0
[13260.872586]  tick_sched_handle.isra.14+0x1f/0x60
[13260.877286]  tick_sched_timer+0x6f/0x80
[13260.881210]  __hrtimer_run_queues+0x1bb/0x490
[13260.885662]  hrtimer_interrupt+0x10a/0x220
[13260.889842]  ? lock_is_held_type+0xe3/0x140
[13260.894110]  __sysvec_apic_timer_interrupt+0x8a/0x250
[13260.899247]  sysvec_apic_timer_interrupt+0x99/0xc0
[13260.904126]  </IRQ>
[13260.906311]  <TASK>
[13260.908497]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[13260.913723] RIP: 0010:debug_smp_processor_id+0xe/0x20
[13260.918861] Code: 00 01 0f 85 66 ff ff ff e9 3a ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 c7 c6 b5 37 39 82 48 c7 c7 0e 6c 3a 82 <e9> fd fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 fe 48
[13260.937769] RSP: 0018:ffffc90001b77d48 EFLAGS: 00000206
[13260.943083] RAX: 0000000000000001 RBX: ffffc90001b77d77 RCX: 0000000000000000
[13260.950308] RDX: 0000000000000001 RSI: ffffffff823937b5 RDI: ffffffff823a6c0e
[13260.957528] RBP: ffff88810bfb24c8 R08: 0000000000000622 R09: 0000000000000000
[13260.964747] R10: 0000000000000001 R11: 3fd23774e533f8b1 R12: ffffffffa0961443
[13260.971967] R13: ffff888235830000 R14: ffff888235830998 R15: ffffffffa0982353
[13260.979187]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13260.984694]  rcu_is_watching+0xd/0x40
[13260.988441]  rcu_read_lock_held_common+0x21/0x40
[13260.993151]  rcu_read_lock_sched_held+0x23/0x90
[13260.997770]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13261.003268]  lock_release+0x1f9/0x2b0
[13261.007018]  _raw_spin_unlock+0x17/0x40
[13261.010941]  flush_dirty_session_caps+0x63/0x150 [ceph]
[13261.016261]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13261.021404]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13261.026816]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13261.031790]  ceph_kill_sb+0x1c/0x90 [ceph]
[13261.035982]  deactivate_locked_super+0x29/0x60
[13261.040516]  cleanup_mnt+0xb8/0x140
[13261.044093]  task_work_run+0x6d/0xb0
[13261.047759]  exit_to_user_mode_prepare+0x226/0x230
[13261.052639]  syscall_exit_to_user_mode+0x25/0x60
[13261.057344]  do_syscall_64+0x40/0x80
[13261.061010]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13261.066151] RIP: 0033:0x7f11c7cfcdfb
[13261.069814] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13261.088716] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13261.096389] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13261.103614] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13261.110834] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13261.118051] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13261.125274] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13261.132509]  </TASK>
[13284.949670] watchdog: BUG: soft lockup - CPU#0 stuck for 86s! [umount:203080]
[13284.956902] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13285.013335] irq event stamp: 188948
[13285.016914] hardirqs last  enabled at (188947): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13285.027034] hardirqs last disabled at (188948): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13285.036708] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13285.045252] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13285.053712] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13285.062952] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13285.070538] RIP: 0010:ceph_check_caps+0x0/0x9e0 [ceph]
[13285.075774] Code: 97 50 08 00 00 b8 01 00 00 00 48 39 ca 76 1a 48 8b b7 58 08 00 00 31 c0 48 39 f2 76 0c 48 01 c9 48 01 f2 48 39 d1 0f 93 c0 c3 <0f> 1f 44 00 00 41 57 41 56 41 55 41 54 55 89 f5 53 48 89 fb 48 81
[13285.094680] RSP: 0018:ffffc90001b77dd8 EFLAGS: 00000246
[13285.099989] RAX: 0000000000000000 RBX: ffff88817f1b4d00 RCX: 0000000000000001
[13285.107212] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88817f1b4d00
[13285.114432] RBP: ffff88817f1b53b8 R08: 0000000000000622 R09: 0000000000000000
[13285.121652] R10: 0000000000000001 R11: 3fd23774e533f8b1 R12: ffff88810bfb24b0
[13285.128870] R13: ffff888235830000 R14: ffff888235830998 R15: ffffffffa0982353
[13285.136091] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13285.144285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13285.150118] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13285.157343] Call Trace:
[13285.159869]  <TASK>
[13285.162053]  flush_dirty_session_caps+0x72/0x150 [ceph]
[13285.167375]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13285.172519]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13285.177934]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13285.182906]  ceph_kill_sb+0x1c/0x90 [ceph]
[13285.187098]  deactivate_locked_super+0x29/0x60
[13285.191631]  cleanup_mnt+0xb8/0x140
[13285.195208]  task_work_run+0x6d/0xb0
[13285.198876]  exit_to_user_mode_prepare+0x226/0x230
[13285.203757]  syscall_exit_to_user_mode+0x25/0x60
[13285.208460]  do_syscall_64+0x40/0x80
[13285.212124]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13285.217267] RIP: 0033:0x7f11c7cfcdfb
[13285.220931] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13285.239838] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13285.247513] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13285.254744] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13285.261966] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13285.269186] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13285.276404] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13285.283633]  </TASK>
[13312.949851] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [umount:203080]
[13312.957189] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13313.013624] irq event stamp: 244580
[13313.017199] hardirqs last  enabled at (244579): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13313.027315] hardirqs last disabled at (244580): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13313.036994] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13313.045535] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13313.053996] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13313.063235] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13313.070825] RIP: 0010:do_raw_spin_unlock+0x41/0xb0
[13313.075704] Code: 48 65 48 8b 04 25 00 be 01 00 48 39 43 10 75 53 65 8b 05 f2 ee eb 7e 39 43 08 75 61 48 c7 c0 ff ff ff ff 48 89 43 10 89 43 08 <c6> 03 00 5b c3 e8 75 cb 3f 00 85 c0 74 c3 48 c7 c6 a2 0b 35 82 48
[13313.094610] RSP: 0018:ffffc90001b77dc0 EFLAGS: 00000246
[13313.099919] RAX: ffffffffffffffff RBX: ffff88810bfb24b0 RCX: ffffc90001b77d94
[13313.107143] RDX: 0000000000000001 RSI: ffffffff8233333d RDI: ffff88810bfb24b0
[13313.114363] RBP: ffff88817f1b53b8 R08: 0000000000000622 R09: 0000000000000000
[13313.121581] R10: 0000000000000001 R11: 3fd23774e533f8b1 R12: ffff88810bfb24b0
[13313.128800] R13: ffff888235830000 R14: ffff888235830998 R15: ffffffffa0982353
[13313.136021] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13313.144213] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13313.150041] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13313.157263] Call Trace:
[13313.159791]  <TASK>
[13313.161975]  _raw_spin_unlock+0x1f/0x40
[13313.165898]  flush_dirty_session_caps+0x63/0x150 [ceph]
[13313.171224]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13313.176369]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13313.181780]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13313.186753]  ceph_kill_sb+0x1c/0x90 [ceph]
[13313.190946]  deactivate_locked_super+0x29/0x60
[13313.195477]  cleanup_mnt+0xb8/0x140
[13313.199056]  task_work_run+0x6d/0xb0
[13313.202722]  exit_to_user_mode_prepare+0x226/0x230
[13313.207605]  syscall_exit_to_user_mode+0x25/0x60
[13313.212308]  do_syscall_64+0x40/0x80
[13313.215973]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13313.221115] RIP: 0033:0x7f11c7cfcdfb
[13313.232528] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13313.251435] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13313.259109] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13313.266333] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13313.273553] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13313.280772] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13313.287991] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13313.295221]  </TASK>
[13340.950034] watchdog: BUG: soft lockup - CPU#0 stuck for 138s! [umount:203080]
[13340.957368] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13341.013803] irq event stamp: 300200
[13341.017371] hardirqs last  enabled at (300199): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13341.027486] hardirqs last disabled at (300200): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13341.037165] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13341.045709] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13341.054167] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13341.063406] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13341.070997] RIP: 0010:rcu_lockdep_current_cpu_online+0xc/0x80
[13341.076831] Code: 01 65 ff 0d 76 50 ea 7e 74 02 5b c3 0f 1f 44 00 00 5b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 bd 01 00 00 00 53 <65> 8b 05 4d 50 ea 7e a9 00 00 f0 00 75 0a 8b 05 1c c2 72 01 85 c0
[13341.095734] RSP: 0018:ffffc90001b77d48 EFLAGS: 00000202
[13341.101043] RAX: 0000000000000001 RBX: ffffc90001b77d77 RCX: 0000000000000000
[13341.108268] RDX: 0000000000000001 RSI: ffffffff823937b5 RDI: ffffffff823a6c0e
[13341.115487] RBP: 0000000000000001 R08: 0000000000000622 R09: 0000000000000000
[13341.122706] R10: 0000000000000001 R11: 3fd23774e533f8b1 R12: ffffffffa0961443
[13341.129926] R13: ffff888235830000 R14: ffff888235830998 R15: ffffffffa0982353
[13341.137146] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13341.145338] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13341.151164] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13341.158387] Call Trace:
[13341.160916]  <TASK>
[13341.163100]  rcu_read_lock_held_common+0x2a/0x40
[13341.167806]  rcu_read_lock_sched_held+0x23/0x90
[13341.172422]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13341.177918]  lock_release+0x1f9/0x2b0
[13341.181669]  _raw_spin_unlock+0x17/0x40
[13341.185596]  flush_dirty_session_caps+0x63/0x150 [ceph]
[13341.190920]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13341.196066]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13341.201478]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13341.206460]  ceph_kill_sb+0x1c/0x90 [ceph]
[13341.210650]  deactivate_locked_super+0x29/0x60
[13341.215182]  cleanup_mnt+0xb8/0x140
[13341.218762]  task_work_run+0x6d/0xb0
[13341.222428]  exit_to_user_mode_prepare+0x226/0x230
[13341.227310]  syscall_exit_to_user_mode+0x25/0x60
[13341.232014]  do_syscall_64+0x40/0x80
[13341.235677]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13341.240811] RIP: 0033:0x7f11c7cfcdfb
[13341.244474] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13341.263375] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13341.271048] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13341.278264] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13341.285486] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13341.292704] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13341.299924] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13341.307153]  </TASK>
[13368.950214] watchdog: BUG: soft lockup - CPU#0 stuck for 164s! [umount:203080]
[13368.957547] Modules linked in: ceph libceph dns_resolver fscache netfs veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink bridge stp llc binfmt_misc overlay xfs libcrc32c sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev wmi mei_me lpc_ich i2c_i801 mei mfd_core i2c_smbus acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ioatdma acpi_pad acpi_power_meter ixgbe crc32c_intel igb mdio ptp nvme i2c_algo_bit pps_core nvme_core dca fuse
[13369.013981] irq event stamp: 355794
[13369.017551] hardirqs last  enabled at (355793): [<ffffffff81c00d42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[13369.027666] hardirqs last disabled at (355794): [<ffffffff81ae1e4a>] sysvec_apic_timer_interrupt+0xa/0xc0
[13369.037349] softirqs last  enabled at (11798): [<ffffffff81e00321>] __do_softirq+0x321/0x483
[13369.045896] softirqs last disabled at (11741): [<ffffffff810e4a0c>] irq_exit_rcu+0xdc/0x120
[13369.054356] CPU: 0 PID: 203080 Comm: umount Tainted: G S           L    5.18.0-ceph-g342bda2e5287 #1
[13369.063594] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0b 04/10/2017
[13369.071183] RIP: 0010:lock_acquire+0x107/0x2f0
[13369.075717] Code: c4 20 e8 ec f3 98 00 b8 ff ff ff ff 65 0f c1 05 d7 84 ec 7e 83 f8 01 0f 85 d3 01 00 00 48 83 3c 24 00 74 06 fb 0f 1f 44 00 00 <48> 8b 44 24 30 65 48 33 04 25 28 00 00 00 0f 85 d0 01 00 00 48 83
[13369.094623] RSP: 0018:ffffc90001b77bb8 EFLAGS: 00000206
[13369.099930] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[13369.107155] RDX: 917bf5a7cab73d6b RSI: ffffffff8233333d RDI: ffffffff823a6c1f
[13369.114375] RBP: ffff88817f1b51b8 R08: 00000000000037ab R09: 0000000000000000
[13369.121593] R10: 0000000000000001 R11: 917bf5a7cab73d6b R12: 0000000000000000
[13369.128813] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[13369.136034] FS:  00007f11c96bd080(0000) GS:ffff88885fc00000(0000) knlGS:0000000000000000
[13369.144228] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13369.150061] CR2: 00007f11c7c501f0 CR3: 000000017a60e001 CR4: 00000000001706f0
[13369.157286] Call Trace:
[13369.159822]  <TASK>
[13369.162005]  ? ceph_check_caps+0x828/0x9e0 [ceph]
[13369.166804]  ? lock_release+0x13d/0x2b0
[13369.170730]  _raw_spin_lock+0x2f/0x40
[13369.174480]  ? ceph_check_caps+0x68/0x9e0 [ceph]
[13369.179188]  ceph_check_caps+0x68/0x9e0 [ceph]
[13369.183734]  ? __lock_acquire+0xa25/0x1d80
[13369.187920]  ? lock_is_held_type+0xe3/0x140
[13369.192186]  ? find_held_lock+0x2d/0x90
[13369.196112]  ? flush_dirty_session_caps+0x63/0x150 [ceph]
[13369.201600]  ? lock_release+0x13d/0x2b0
[13369.205526]  flush_dirty_session_caps+0x72/0x150 [ceph]
[13369.210850]  ? __ceph_put_cap_refs+0x440/0x440 [ceph]
[13369.215997]  ceph_mdsc_iterate_sessions+0x65/0xa0 [ceph]
[13369.221409]  ceph_mdsc_pre_umount+0x52/0x230 [ceph]
[13369.226381]  ceph_kill_sb+0x1c/0x90 [ceph]
[13369.230573]  deactivate_locked_super+0x29/0x60
[13369.235104]  cleanup_mnt+0xb8/0x140
[13369.238675]  task_work_run+0x6d/0xb0
[13369.242342]  exit_to_user_mode_prepare+0x226/0x230
[13369.247224]  syscall_exit_to_user_mode+0x25/0x60
[13369.251929]  do_syscall_64+0x40/0x80
[13369.255590]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[13369.260726] RIP: 0033:0x7f11c7cfcdfb
[13369.264387] Code: 20 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 20 2c 00 f7 d8 64 89 01 48
[13369.283288] RSP: 002b:00007ffe5921e968 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[13369.290961] RAX: 0000000000000000 RBX: 000055d1a94d15d0 RCX: 00007f11c7cfcdfb
[13369.298188] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055d1a94d17b0
[13369.305406] RBP: 0000000000000000 R08: 000055d1a94d17e0 R09: 00007f11c7d80620
[13369.312624] R10: 0000000000000000 R11: 0000000000000246 R12: 000055d1a94d17b0
[13369.319846] R13: 00007f11c9258184 R14: 0000000000000000 R15: 00000000ffffffff
[13369.327073]  </TASK>

History

#2 Updated by Xiubo Li 6 months ago

  • Priority changed from Normal to Urgent

#3 Updated by Jeff Layton 6 months ago

2 of the 3 failures on this test run were due to this problem:

https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/

The symptoms seem to be pretty consistent. First, the tx queue times out, and the adapter is reset. Then we see socket errors on write, and then soon afterward we start seeing softlockups.

One hypothesis is that the adapter reset doesn't actually clear the problem. We end up not being able to transmit the cap flushes and we end up spinning in this function.

I think we probably need to engage the upstream ixgbe developers and see whether they've seen this issue and can offer any advice.

While this case seems to involve some sort of problem transmitting requests on the socket, flush_dirty_session_caps looks problematic since it appears to involve an unbounded set of work. Can we end up dirtying caps faster than we can clean them and never fully empty the list? I think we may need some way to bound what gets done in that function.

#4 Updated by Xiubo Li 6 months ago

Jeff Layton wrote:

2 of the 3 failures on this test run were due to this problem:

https://pulpito.ceph.com/jlayton-2022-06-02_19:51:44-fs-wip-jlayton-qa-custom-kernel-wip-fscrypt-default-smithi/

The symptoms seem to be pretty consistent. First, the tx queue times out, and the adapter is reset. Then we see socket errors on write, and then soon afterward we start seeing softlockups.

One hypothesis is that the adapter reset doesn't actually clear the problem. We end up not being able to transmit the cap flushes and we end up spinning in this function.

I think we probably need to engage the upstream ixgbe developers and see whether they've seen this issue and can offer any advice.

While this case seems to involve some sort of problem transmitting requests on the socket, flush_dirty_session_caps looks problematic since it appears to involve an unbounded set of work. Can we end up dirtying caps faster than we can clean them and never fully empty the list? I think we may need some way to bound what gets done in that function.

Couldn't get where the problematic is in flush_dirty_session_caps().

4378 static void flush_dirty_session_caps(struct ceph_mds_session *s)
4379 { 
4380         struct ceph_mds_client *mdsc = s->s_mdsc;
4381         struct ceph_inode_info *ci;
4382         struct inode *inode;  
4383   
4384         dout("flush_dirty_caps\n");
4385         spin_lock(&mdsc->cap_dirty_lock);   
4386         while (!list_empty(&s->s_cap_dirty)) {
4387                 ci = list_first_entry(&s->s_cap_dirty, struct ceph_inode_info,
4388                                       i_dirty_item);                      
4389                 inode = &ci->vfs_inode;             
4390                 ihold(inode); 
4391                 dout("flush_dirty_caps %llx.%llx\n", ceph_vinop(inode));
4392                 spin_unlock(&mdsc->cap_dirty_lock); 
4393                 ceph_check_caps(ci, CHECK_CAPS_FLUSH, NULL);                                                                                      
4394                 iput(inode);  
4395                 spin_lock(&mdsc->cap_dirty_lock);   
4396         }
4397         spin_unlock(&mdsc->cap_dirty_lock); 
4398         dout("flush_dirty_caps done\n");    
4399 }

From the code before each inode it will release the mdsc->cap_dirty_lock. And for each inode it shouldn't have too many caps and shouldn't take too long unless it's buggy and keeps retrying in ceph_check_caps() by holding the ci->i_ceph_lock ?

#5 Updated by Jeff Layton 6 months ago

That function takes the first entry of the s_cap_dirty list and then calls ceph_check_caps on it. It's implicitly expecting that the entry will be dequeued from s_cap_dirty, but I don't see where that's guaranteed to occur. There are a lot of special cases in that function that can result in no cap being sent and the thing not being marked flushing.

For instance, we have this in ceph_check_caps:

        spin_lock(&ci->i_ceph_lock);
        if (ci->i_ceph_flags & CEPH_I_ASYNC_CREATE) {
                /* Don't send messages until we get async create reply */
                spin_unlock(&ci->i_ceph_lock);
                ceph_put_mds_session(session);
                return;
        }

If an async create reply gets delayed for a long time (as we might see when there are network issues), then this could cause it to continually loop. There are other potential things that could cause it to remain on s_cap_dirty as well.

What might be best at this point is to add some extra logging in ceph_check_caps when CEPH_CAPS_FLUSH is set, and it returns without dequeueing the cap. That may help us to confirm the problem we're seeing (and may eventually suggest a fix).

A more long term solution probably involves a redesign of how caps are managed in the client. The current code for this is completely ad-hoc and rife with weird emergent behaviors.

#6 Updated by Jeff Layton 6 months ago

I'm testing this patch now which I think might be at least part of the solution here. If it's spinning due to waiting on async create replies, then this prevent that:

From 445e9c77524879060120b53c84b129da34405762 Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@kernel.org>
Date: Mon, 6 Jun 2022 11:07:59 -0400
Subject: [PATCH] ceph: wait on async create before checking caps

Currently, we'll call ceph_check_caps, but if we're still waiting on the
reply, we'll end up spinning around on the same inode. Wait for the
async create reply before we try to flush caps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/ceph/caps.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 0a48bf829671..5ecfff4b37c9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -4389,6 +4389,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s)
         ihold(inode);
         dout("flush_dirty_caps %llx.%llx\n", ceph_vinop(inode));
         spin_unlock(&mdsc->cap_dirty_lock);
+        ceph_wait_on_async_create(inode);
         ceph_check_caps(ci, CHECK_CAPS_FLUSH, NULL);
         iput(inode);
         spin_lock(&mdsc->cap_dirty_lock);
-- 
2.36.1

I'll post it upstream if it looks ok. My first run with it ran into teuthology infrastructure issues (a lot of dead jobs).

Also available in: Atom PDF