Bug #62096: mds: infinite rename recursion on itself - CephFS - Ceph

Actions

Copy link

Bug #62096

closed

mds: infinite rename recursion on itself

Added by Patrick Donnelly 10 months ago. Updated 7 months ago.

Status:

Duplicate

Priority:

High

Assignee:

Patrick Donnelly

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Q/A

Tags:

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

qa-failure

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/rishabh-2023-07-14_10:26:42-fs-wip-rishabh-2023Jul13-testing-default-smithi/7337403

I don't have an explanation for why PQputline failed specifically but apparently we hit some new (possible) deadlock:

2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: 973 slow requests, 5 included below; oldest blocked for > 183.521234 secs
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.221582 seconds old, received at 2023-07-14T11:13:58.402232+0000: client_request(mds.1:948 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.190394 seconds old, received at 2023-07-14T11:13:58.433419+0000: client_request(mds.1:980 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.043 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.147883 seconds old, received at 2023-07-14T11:13:58.475930+0000: client_request(mds.1:1012 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.044 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.114251 seconds old, received at 2023-07-14T11:13:58.509562+0000: client_request(mds.1:1044 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting                                
2023-07-14T11:17:03.044 INFO:journalctl@ceph.mon.c.smithi130.stdout:Jul 14 11:17:02 smithi130 ceph-mon[129405]: slow request 183.081030 seconds old, received at 2023-07-14T11:13:58.542783+0000: client_request(mds.1:1076 rename #0x10000000002/0000000100000000000000E1 #0x60b/100000005f2 caller_uid=0, caller_gid=0{}) currently failed to xlock, waiting

There's no evidence of metadata corruption (tracker 54546).

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Description updated (diff)

Actions

Copy link

Updated by Xiubo Li 9 months ago

Patrick,

This should be the same issue with:

https://tracker.ceph.com/issues/58340
https://tracker.ceph.com/issues/61818

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Related to Bug #58340: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Related to Bug #61818: mds: deadlock between unlink and linkmerge added

Actions

Copy link

Updated by Patrick Donnelly 9 months ago

Xiubo Li wrote:

Patrick,

This should be the same issue with:

https://tracker.ceph.com/issues/58340
https://tracker.ceph.com/issues/61818

Hi Xiubo, AFAICT there was no actual deadlock. I think it's a slowdown caused by thousands of rename ops for a single stray migration. The cost of acquiring the locks is quite high which means it takes a long time for those ops to unwind (ENOENT because the first one succeeds).

Actions

Copy link

Updated by Xiubo Li 9 months ago

Patrick Donnelly wrote:

Xiubo Li wrote:

Patrick,

This should be the same issue with:

https://tracker.ceph.com/issues/58340
https://tracker.ceph.com/issues/61818

Hi Xiubo, AFAICT there was no actual deadlock. I think it's a slowdown caused by thousands of rename ops for a single stray migration. The cost of acquiring the locks is quite high which means it takes a long time for those ops to unwind (ENOENT because the first one succeeds).

Okay. As I remembered when the unlink and linkmerge are deadlock, I can see a lot of rename requests recursively.

Actions

Copy link