Project

General

Profile

Actions

Bug #62499

closed

testing (?): deadlock ffsb task

Added by Patrick Donnelly 8 months ago. Updated 8 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

During mon thrashing, the ffsb task hangs/timesout:

2023-08-17T18:44:16.911 DEBUG:teuthology.orchestra.run:got remote process result: 124
2023-08-17T18:44:16.912 INFO:tasks.workunit:Stopping ['suites/ffsb.sh'] on client.0...

/teuthology/yuriw-2023-08-17_14:36:27-fs-wip-yuri7-testing-2023-08-16-1309-pacific-distro-default-smithi/7371826/teuthology.log

Kernel debug log shows (and more):

2023-08-17T15:43:20.946820+00:00 smithi040 kernel: [  795.107032] ceph-brx: port 1(brx.0) entered blocking state
2023-08-17T15:43:20.946822+00:00 smithi040 kernel: [  795.107049] ceph-brx: port 1(brx.0) entered forwarding state
2023-08-17T15:43:21.125861+00:00 smithi040 kernel: [  795.289011] FS-Cache: Loaded
2023-08-17T15:43:21.146808+00:00 smithi040 kernel: [  795.308622] Key type ceph registered
2023-08-17T15:43:21.146830+00:00 smithi040 kernel: [  795.308831] libceph: loaded (mon/osd proto 15/24)
2023-08-17T15:43:21.174770+00:00 smithi040 kernel: [  795.336289] ceph: loaded (mds proto 32)
2023-08-17T15:43:21.190790+00:00 smithi040 kernel: [  795.350533] libceph: mon0 (1)172.21.15.40:6789 session established
2023-08-17T15:43:21.190810+00:00 smithi040 kernel: [  795.351801] libceph: client4671 fsid 02a912b5-8b92-490a-8739-cd10ec7dc459
2023-08-17T15:44:35.990773+00:00 smithi040 kernel: [  870.152057] libceph: mon0 (1)172.21.15.40:6789 socket closed (con state OPEN)
2023-08-17T15:44:35.990793+00:00 smithi040 kernel: [  870.152577] libceph: mon0 (1)172.21.15.40:6789 session lost, hunting for new mon
2023-08-17T15:44:35.994727+00:00 smithi040 kernel: [  870.156935] libceph: mon1 (1)172.21.15.191:6789 session established
2023-08-17T15:44:50.917206+00:00 smithi040 kernel: [  885.080193] ceph: [02a912b5-8b92-490a-8739-cd10ec7dc459 4671] handle_session: mds1 is already opened
2023-08-17T15:46:21.634870+00:00 smithi040 kernel: [  975.798286] ------------[ cut here ]------------
2023-08-17T15:46:21.634890+00:00 smithi040 kernel: [  975.798331] WARNING: CPU: 3 PID: 40795 at fs/ceph/caps.c:711 ceph_add_cap+0x631/0x650 [ceph]
2023-08-17T15:46:21.634892+00:00 smithi040 kernel: [  975.798361] Modules linked in: ceph libceph fscache netfs veth xfs xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c br_netfilter bridge stp llc rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm configfs ib_core overlay ipmi_ssif intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel joydev crypto_simd cryptd mei_me mei ioatdma wmi acpi_ipmi ipmi_si ipmi_devintf acpi_power_meter ipmi_msghandler kvm_intel acpi_pad kvm irqbypass sch_fq_codel scsi_transport_iscsi lp parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid igb ixgbe i2c_algo_bit nvme dca crc32_pclmul ahci ptp i2c_i801 nvme_core lpc_ich i2c_smbus libahci pps_core mdio
2023-08-17T15:46:21.634893+00:00 smithi040 kernel: [  975.798577] CPU: 3 PID: 40795 Comm: kworker/3:26 Not tainted 6.5.0-rc4-ga7fb1265323d #1
2023-08-17T15:46:21.634894+00:00 smithi040 kernel: [  975.798582] Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015
2023-08-17T15:46:21.634895+00:00 smithi040 kernel: [  975.798586] Workqueue: ceph-msgr ceph_con_workfn [libceph]
2023-08-17T15:46:21.634912+00:00 smithi040 kernel: [  975.798618] RIP: 0010:ceph_add_cap+0x631/0x650 [ceph]
2023-08-17T15:46:21.634913+00:00 smithi040 kernel: [  975.798638] Code: c1 e8 f3 ee de cb 0f 0b 44 89 7c 24 04 e9 be fb ff ff 44 8b 7c 24 04 e9 be fd ff ff 0f 0b e9 06 fc ff ff 0f 0b e9 6d fd ff ff <0f> 0b e9 75 fd ff ff 48 89 f1 31 c0 e9 d6 fa ff ff 0f 0b 66 66 2e
2023-08-17T15:46:21.634914+00:00 smithi040 kernel: [  975.798642] RSP: 0018:ffffa298c508b9e8 EFLAGS: 00010287
2023-08-17T15:46:21.634915+00:00 smithi040 kernel: [  975.798648] RAX: 0000000000000037 RBX: ffff9513c9fc3ab0 RCX: ffff95133cb83698
2023-08-17T15:46:21.634916+00:00 smithi040 kernel: [  975.798652] RDX: 0000000000000006 RSI: ffff9512e01d3908 RDI: ffff9512e01d38f0
2023-08-17T15:46:21.634917+00:00 smithi040 kernel: [  975.798655] RBP: ffff9512e01d3000 R08: ffff9512e01d3938 R09: ffff9512e01d3938
2023-08-17T15:46:21.634925+00:00 smithi040 kernel: [  975.798659] R10: 0000000000000006 R11: 0000000000000000 R12: 0000000000000d55
2023-08-17T15:46:21.634926+00:00 smithi040 kernel: [  975.798662] R13: ffff95133cb83690 R14: 0000000000000000 R15: 0000000000000005
2023-08-17T15:46:21.634927+00:00 smithi040 kernel: [  975.798666] FS:  0000000000000000(0000) GS:ffff951a1fcc0000(0000) knlGS:0000000000000000
2023-08-17T15:46:21.634927+00:00 smithi040 kernel: [  975.798670] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-08-17T15:46:21.634928+00:00 smithi040 kernel: [  975.798674] CR2: 0000564cc843f000 CR3: 0000000126266004 CR4: 00000000001706e0
2023-08-17T15:46:21.634929+00:00 smithi040 kernel: [  975.798678] Call Trace:
2023-08-17T15:46:21.634936+00:00 smithi040 kernel: [  975.798682]  <TASK>
2023-08-17T15:46:21.634937+00:00 smithi040 kernel: [  975.798686]  ? __warn+0x84/0x170
2023-08-17T15:46:21.634938+00:00 smithi040 kernel: [  975.798693]  ? ceph_add_cap+0x631/0x650 [ceph]
2023-08-17T15:46:21.634938+00:00 smithi040 kernel: [  975.798716]  ? report_bug+0x191/0x1a0
2023-08-17T15:46:21.634939+00:00 smithi040 kernel: [  975.798726]  ? handle_bug+0x42/0x70
2023-08-17T15:46:21.634940+00:00 smithi040 kernel: [  975.798732]  ? exc_invalid_op+0x18/0x70
2023-08-17T15:46:21.634947+00:00 smithi040 kernel: [  975.798738]  ? asm_exc_invalid_op+0x1a/0x20
2023-08-17T15:46:21.634948+00:00 smithi040 kernel: [  975.798755]  ? ceph_add_cap+0x631/0x650 [ceph]
2023-08-17T15:46:21.634949+00:00 smithi040 kernel: [  975.798776]  ? ceph_add_cap+0x362/0x650 [ceph]
2023-08-17T15:46:21.634950+00:00 smithi040 kernel: [  975.798806]  ceph_fill_inode+0x110a/0x1690 [ceph]
2023-08-17T15:46:21.634950+00:00 smithi040 kernel: [  975.798845]  ceph_fill_trace+0x186/0xf50 [ceph]
2023-08-17T15:46:21.634951+00:00 smithi040 kernel: [  975.798865]  ? __mutex_lock+0x97/0xd00
2023-08-17T15:46:21.634952+00:00 smithi040 kernel: [  975.798870]  ? __mutex_lock+0xaac/0xd00
2023-08-17T15:46:21.634959+00:00 smithi040 kernel: [  975.798877]  ? mds_dispatch+0xedc/0x20a0 [ceph]
2023-08-17T15:46:21.634960+00:00 smithi040 kernel: [  975.798898]  ? lock_downgrade+0x100/0x250
2023-08-17T15:46:21.634960+00:00 smithi040 kernel: [  975.798905]  ? mds_dispatch+0xefc/0x20a0 [ceph]
2023-08-17T15:46:21.634961+00:00 smithi040 kernel: [  975.798933]  mds_dispatch+0xf20/0x20a0 [ceph]
2023-08-17T15:46:21.634962+00:00 smithi040 kernel: [  975.798961]  ? __mutex_unlock_slowpath+0x45/0x280
2023-08-17T15:46:21.634962+00:00 smithi040 kernel: [  975.798972]  ceph_con_process_message+0x78/0x160 [libceph]
2023-08-17T15:46:21.634970+00:00 smithi040 kernel: [  975.798994]  ? calc_signature+0x108/0x130 [libceph]
2023-08-17T15:46:21.634971+00:00 smithi040 kernel: [  975.799022]  ceph_con_v1_try_read+0x11a0/0x1e00 [libceph]
2023-08-17T15:46:21.634971+00:00 smithi040 kernel: [  975.799050]  ? ceph_con_workfn+0x2e/0x730 [libceph]
2023-08-17T15:46:21.634972+00:00 smithi040 kernel: [  975.799072]  ? lock_acquire+0xcc/0x2a0
2023-08-17T15:46:21.634973+00:00 smithi040 kernel: [  975.799078]  ? process_one_work+0x1b9/0x510
2023-08-17T15:46:21.634973+00:00 smithi040 kernel: [  975.799090]  ceph_con_workfn+0x383/0x730 [libceph]
2023-08-17T15:46:21.634974+00:00 smithi040 kernel: [  975.799115]  process_one_work+0x23d/0x510
2023-08-17T15:46:21.634982+00:00 smithi040 kernel: [  975.799129]  worker_thread+0x56/0x3b0
2023-08-17T15:46:21.634982+00:00 smithi040 kernel: [  975.799138]  ? process_one_work+0x510/0x510
2023-08-17T15:46:21.634983+00:00 smithi040 kernel: [  975.799144]  kthread+0xf1/0x120
2023-08-17T15:46:21.634984+00:00 smithi040 kernel: [  975.799150]  ? kthread_complete_and_exit+0x20/0x20
2023-08-17T15:46:21.634984+00:00 smithi040 kernel: [  975.799158]  ret_from_fork+0x31/0x50
2023-08-17T15:46:21.634985+00:00 smithi040 kernel: [  975.799163]  ? kthread_complete_and_exit+0x20/0x20
2023-08-17T15:46:21.634992+00:00 smithi040 kernel: [  975.799169]  ret_from_fork_asm+0x11/0x20
2023-08-17T15:46:21.634993+00:00 smithi040 kernel: [  975.799220]  </TASK>
2023-08-17T15:46:21.634994+00:00 smithi040 kernel: [  975.799224] irq event stamp: 547981
2023-08-17T15:46:21.634995+00:00 smithi040 kernel: [  975.799228] hardirqs last  enabled at (547987): [<ffffffff8d13295e>] __up_console_sem+0x5e/0x70
2023-08-17T15:46:21.634996+00:00 smithi040 kernel: [  975.799235] hardirqs last disabled at (547992): [<ffffffff8d132943>] __up_console_sem+0x43/0x70
2023-08-17T15:46:21.634996+00:00 smithi040 kernel: [  975.799240] softirqs last  enabled at (547974): [<ffffffff8d0a4047>] __irq_exit_rcu+0x77/0xa0
2023-08-17T15:46:21.635003+00:00 smithi040 kernel: [  975.799246] softirqs last disabled at (547965): [<ffffffff8d0a4047>] __irq_exit_rcu+0x77/0xa0
2023-08-17T15:46:21.635004+00:00 smithi040 kernel: [  975.799251] ---[ end trace 0000000000000000 ]---

From: /teuthology/yuriw-2023-08-17_14:36:27-fs-wip-yuri7-testing-2023-08-16-1309-pacific-distro-default-smithi/7371826/remote/smithi040/syslog/kern.log.gz

This was the testing branch of the kernel.

Actions #1

Updated by Patrick Donnelly 8 months ago

Similar one with iozone: /teuthology/yuriw-2023-08-17_14:36:27-fs-wip-yuri7-testing-2023-08-16-1309-pacific-distro-default-smithi/7371806/remote/smithi103/syslog/kern.log.gz

Actions #2

Updated by Patrick Donnelly 8 months ago

Another one that's similar: /teuthology/yuriw-2023-08-17_14:36:27-fs-wip-yuri7-testing-2023-08-16-1309-pacific-distro-default-smithi/7371800/remote/smithi093/syslog/kern.log.gz

This may be caused by https://github.com/ceph/ceph/pull/52852

Actions #3

Updated by Xiubo Li 8 months ago

  • Status changed from New to In Progress
  • Assignee set to Xiubo Li
Actions #4

Updated by Xiubo Li 8 months ago

This was a regression and introduced by https://github.com/ceph/ceph/commit/9fbde6da076f2d7c8bfde50c3f9e581e967c0df0.

In case:

Client:

1, queues releasing the caps and then deletes the corresponding inodeA
2, do a lookup for the inodeA just deleted in step 1, but the lookup request won't carry the queued releasing caps in step 1
3, the lookup in step 2 will add the inodeA back together with the cap, which with the old cap_id, just queued in step 1
4, do a create with inodeA/subfile and the create request will carry the caps releasing to MDS and then MDS will remove the caps and add a new caps again with a fresh new cap_id and then send it back together with the create reply.
5, the create reply will add the cap back but it's already exist and then comparing the old cap_id and fresh new cap_id and then crash.

Actions #5

Updated by Xiubo Li 8 months ago

Just cherry-picked the stale commit https://github.com/lxbsz/ceph/commit/9fbde6da076f2d7c8bfde50c3f9e581e967c0df0, which has been reverted and replaced by https://github.com/lxbsz/ceph/commit/7aaf4ba81b978db63b9cb11a90f881196530e5d5.

It's not a bug in the upstream main branch.

Will close the tracker.

Actions #6

Updated by Xiubo Li 8 months ago

  • Status changed from In Progress to Rejected
Actions #7

Updated by Xiubo Li 8 months ago

Why was this buggy in the backport but not main?

In the `main` branch, my oldest PR https://github.com/ceph/ceph/pull/47752 was merged, but it introduced a regression when I checked the qa test failures then talked with Venky and we revert it in a following PR https://github.com/ceph/ceph/pull/51661 soon after that. And then I pushed these commits in another new PR https://github.com/ceph/ceph/pull/51500 followed.

This is why we didn't hit it in the following qa tests.

There was only one commit https://github.com/ceph/ceph/pull/47752/commits/9fbde6da076f2d7c8bfde50c3f9e581e967c0df0 was replaced by the last PR https://github.com/ceph/ceph/pull/51500/commits/7aaf4ba81b978db63b9cb11a90f881196530e5d5 and all the other commits didn't change.

But when backporting the commits I just searched the first commit from the `main` branch by using the commit title, which just gave the incorrect commit, and then got the other 4 commit sha1 following it directly.

Actions

Also available in: Atom PDF