Project

General

Profile

Actions

Bug #62096

closed

mds: infinite rename recursion on itself

Added by Patrick Donnelly 10 months ago. Updated 7 months ago.

Status:
Duplicate
Priority:
High
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/rishabh-2023-07-14_10:26:42-fs-wip-rishabh-2023Jul13-testing-default-smithi/7337403

I don't have an explanation for why PQputline failed specifically but apparently we hit some new (possible) deadlock:

2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: 973 slow requests, 5 included below; oldest blocked for > 183.521234 secs
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.221582 seconds old, received at 2023-07-14T11:13:58.402232+0000: client_request(mds.1:948 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.190394 seconds old, received at 2023-07-14T11:13:58.433419+0000: client_request(mds.1:980 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.147883 seconds old, received at 2023-07-14T11:13:58.475930+0000: client_request(mds.1:1012 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.044 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.114251 seconds old, received at 2023-07-14T11:13:58.509562+0000: client_request(mds.1:1044 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.044 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.081030 seconds old, received at 2023-07-14T11:13:58.542783+0000: client_request(mds.1:1076 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting   

There's no evidence of metadata corruption (tracker 54546).


Related issues 3 (2 open1 closed)

Related to CephFS - Bug #58340: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))ResolvedXiubo Li

Actions
Related to CephFS - Bug #61818: mds: deadlock between unlink and linkmergePending BackportXiubo Li

Actions
Is duplicate of CephFS - Bug #62702: MDS slow requests for the internal 'rename' requestsPending BackportXiubo Li

Actions
Actions #1

Updated by Patrick Donnelly 9 months ago

  • Description updated (diff)
Actions #2

Updated by Xiubo Li 9 months ago

Actions #3

Updated by Patrick Donnelly 9 months ago

  • Related to Bug #58340: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added
Actions #4

Updated by Patrick Donnelly 9 months ago

  • Related to Bug #61818: mds: deadlock between unlink and linkmerge added
Actions #5

Updated by Patrick Donnelly 9 months ago

Xiubo Li wrote:

Patrick,

This should be the same issue with:

https://tracker.ceph.com/issues/58340
https://tracker.ceph.com/issues/61818

Hi Xiubo, AFAICT there was no actual deadlock. I think it's a slowdown caused by thousands of rename ops for a single stray migration. The cost of acquiring the locks is quite high which means it takes a long time for those ops to unwind (ENOENT because the first one succeeds).

Actions #6

Updated by Xiubo Li 9 months ago

Patrick Donnelly wrote:

Xiubo Li wrote:

Patrick,

This should be the same issue with:

https://tracker.ceph.com/issues/58340
https://tracker.ceph.com/issues/61818

Hi Xiubo, AFAICT there was no actual deadlock. I think it's a slowdown caused by thousands of rename ops for a single stray migration. The cost of acquiring the locks is quite high which means it takes a long time for those ops to unwind (ENOENT because the first one succeeds).

Okay. As I remembered when the unlink and linkmerge are deadlock, I can see a lot of rename requests recursively.

Actions #7

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #62702: MDS slow requests for the internal 'rename' requests added
Actions #8

Updated by Patrick Donnelly 7 months ago

  • Status changed from In Progress to Duplicate
Actions #9

Updated by Patrick Donnelly 7 months ago

  • Related to deleted (Bug #62702: MDS slow requests for the internal 'rename' requests)
Actions #10

Updated by Patrick Donnelly 7 months ago

  • Is duplicate of Bug #62702: MDS slow requests for the internal 'rename' requests added
Actions

Also available in: Atom PDF