Project

General

Profile

Actions

Bug #58340

closed

mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))

Added by Venky Shankar over 1 year ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
reef,pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2022-12-21_14:01:01-fs-wip-vshankar-testing-20221215.112736-testing-default-smithi/7123518

The MDS is kind of stuck in some lock transition and not making progress servicing client requests.

2022-12-21T15:08:32.659+0000 1950e700  7 mds.1.server reply_client_request 0 ((0) Success) client_request(client.4536:15443 unlink #0x40000000206/c24 2022-12-21T15:08:32.520905+0000 caller_uid=1000, caller_gid=1265{6,36,1000,1265,}) v6
2022-12-21T15:08:32.659+0000 1950e700 10 mds.1.server apply_allocated_inos 0x0 / [] / 0x0
2022-12-21T15:08:32.659+0000 1950e700  7 mds.1.locker local_wrlock_finish  on (iversion lock w=1 last_client=4536) on [inode 0x101 [...2,head] ~mds1/ auth v21 ap=1 snaprealm=0xf46a930 DIRTYPARENT f(v0 10=0+10) n(v5 rc2022-12-21T15:08:32.520905+0000 b4194304 14=3+11)/n(v0 rc2022-12-21T14:44:28.849089+0000 11=0+11) (inest lock w=1) (iversion lock w=1 last_client=4536) | dirtyscattered=0 lock=2 dirfrag=1 dirtyparent=1 dirty=1 authpin=1 0xf46a0e0]
2022-12-21T15:08:32.659+0000 1950e700  7 mds.1.locker wrlock_finish on (inest lock w=1) on [inode 0x101 [...2,head] ~mds1/ auth v21 ap=1 snaprealm=0xf46a930 DIRTYPARENT f(v0 10=0+10) n(v5 rc2022-12-21T15:08:32.520905+0000 b4194304 14=3+11)/n(v0 rc2022-12-21T14:44:28.849089+0000 11=0+11) (inest lock w=1) (iversion lock) | dirtyscattered=0 lock=1 dirfrag=1 dirtyparent=1 dirty=1 authpin=1 0xf46a0e0]
2022-12-21T15:08:32.659+0000 1950e700 10 mds.1.locker scatter_eval (inest lock) on [inode 0x101 [...2,head] ~mds1/ auth v21 ap=1 snaprealm=0xf46a930 DIRTYPARENT f(v0 10=0+10) n(v5 rc2022-12-21T15:08:32.520905+0000 b4194304 14=3+11)/n(v0 rc2022-12-21T14:44:28.849089+0000 11=0+11) (inest lock) (iversion lock) | dirtyscattered=0 lock=0 dirfrag=1 dirtyparent=1 dirty=1 authpin=1 0xf46a0e0]
2022-12-21T15:08:32.660+0000 1950e700  7 mds.1.locker local_wrlock_finish  on (iversion lock w=1 last_client=4536) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (ifile lock w=1) (iversion lock w=1 last_client=4536) | lock=3 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.660+0000 1950e700  7 mds.1.locker wrlock_finish on (ifile lock w=1) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (ifile lock w=1) (iversion lock) | lock=2 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.660+0000 1950e700  7 mds.1.locker file_eval wanted= loner_wanted= other_wanted=  filelock=(ifile lock) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (ifile lock) (iversion lock) | lock=1 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.660+0000 1950e700  7 mds.1.locker file_eval stable, bump to sync (ifile lock) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (ifile lock) (iversion lock) | lock=1 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.660+0000 1950e700  7 mds.1.locker simple_sync on (ifile lock) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (ifile lock) (iversion lock) | lock=1 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.661+0000 1950e700  7 mds.1.locker wrlock_finish on (inest lock w=1) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock w=1) (iversion lock) | lock=1 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.661+0000 1950e700 10 mds.1.locker scatter_eval (inest lock) on [inode 0x60b [...2,head] ~mds1/stray1/ auth v56 ap=1 DIRTYPARENT f(v0 m2022-12-21T15:08:32.520905+0000 2=2+0) n(v0 rc2022-12-21T15:08:32.520905+0000 b4194304 3=2+1) (inest lock) (iversion lock) | lock=0 dirfrag=1 stickydirs=1 stray=1 dirtyparent=1 dirty=1 authpin=1 0xf475be0]
2022-12-21T15:08:32.661+0000 1950e700 10 mds.1.locker xlock_finish on (dn xlockdone l x=1) [dentry #0x1/client.0/tmp/tmp.UQxwie74AM/p5/d0/d2/d14/d37/c24 [2,head] auth NULL (dn xlockdone l x=1) (dversion lock w=1 last_client=4536) v=350 ap=2 ino=(nil) state=1610612736 | request=1 lock=2 inodepin=0 replicated=0 dirty=1 authpin=1 clientlease=1 0x1acdffd0]
2022-12-21T15:08:32.661+0000 1950e700 10 mds.1.locker eval_gather (dn xlockdone l) on [dentry #0x1/client.0/tmp/tmp.UQxwie74AM/p5/d0/d2/d14/d37/c24 [2,head] auth NULL (dn xlockdone l) (dversion lock w=1 last_client=4536) v=350 ap=2 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 replicated=0 dirty=1 authpin=1 clientlease=1 0x1acdffd0]
2022-12-21T15:08:32.661+0000 1950e700  7 mds.1.locker eval_gather finished gather on (dn xlockdone l) on [dentry #0x1/client.0/tmp/tmp.UQxwie74AM/p5/d0/d2/d14/d37/c24 [2,head] auth NULL (dn xlockdone l) (dversion lock w=1 last_client=4536) v=350 ap=2 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 replicated=0 dirty=1 authpin=1 clientlease=1 0x1acdffd0]
2022-12-21T15:08:32.662+0000 1950e700 10 mds.1.cache.den(0x40000000206 c24) auth_unpin by 0x1ace00b0 on [dentry #0x1/client.0/tmp/tmp.UQxwie74AM/p5/d0/d2/d14/d37/c24 [2,head] auth NULL (dn sync l) (dversion lock w=1 last_client=4536) v=350 ap=1 ino=(nil) state=1610612736 | request=1 lock=1 inodepin=0 replicated=0 dirty=1 authpin=1 clientlease=1 0x1acdffd0] now 1
2022-12-21T15:08:32.662+0000 1950e700 15 mds.1.cache.dir(0x40000000206) adjust_nested_auth_pins -1 on [dir 0x40000000206 /client.0/tmp/tmp.UQxwie74AM/p5/d0/d2/d14/d37/ [2,head] auth{0=1,3=1} v=351 cv=0/0 dir_auth=1 ap=1+1 state=1610874881|complete f(v0 m2022-12-21T15:08:32.520905+0000 11=8+3)/f(v0 m2022-12-21T15:08:10.789215+0000 12=9+3) n(v12 rc2022-12-21T15:08:32.520905+0000 b31204703 35=26+9)/n(v12 rc2022-12-21T15:08:23.108040+0000 b30920189 35=27+8) hs=11+4,ss=0+0 dirty=15 | ptrwaiter=0 child=1 frozen=0 subtree=1 importing=0 replicated=1 dirty=1 authpin=1 0x1cd27040] by 0x1ace00b0 count now 1/1

Related issues 7 (1 open6 closed)

Related to CephFS - Bug #59134: mds: deadlock during unlink with multimds (postgres)Duplicate

Actions
Related to CephFS - Bug #62096: mds: infinite rename recursion on itselfDuplicatePatrick Donnelly

Actions
Related to CephFS - Bug #61818: mds: deadlock between unlink and linkmergePending BackportXiubo Li

Actions
Related to CephFS - Bug #56695: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish)ResolvedXiubo Li

Actions
Copied to CephFS - Backport #61346: pacific: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))ResolvedXiubo LiActions
Copied to CephFS - Backport #61347: reef: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))ResolvedXiubo LiActions
Copied to CephFS - Backport #61348: quincy: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))ResolvedXiubo LiActions
Actions #1

Updated by Xiubo Li over 1 year ago

  • Pull request ID set to 49553
Actions #2

Updated by Venky Shankar over 1 year ago

  • Status changed from New to Triaged
Actions #3

Updated by Xiubo Li about 1 year ago

  • Status changed from Triaged to Fix Under Review
  • Assignee set to Xiubo Li
Actions #4

Updated by Venky Shankar about 1 year ago

  • Related to Bug #59134: mds: deadlock during unlink with multimds (postgres) added
Actions #5

Updated by Xiubo Li 12 months ago

2023-02-10T05:33:35.813+0000 7f3673425223  0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 723.3245213 secs
2023-02-10T05:33:35.813+0000 7f3673425223  0 log_channel(cluster) log [WRN] : slow request 723.32452 seconds old, received at 2023-04-19T18:24:02.644336+0000: client_request(mds.0:343992 rename #0x20000008f81/posd #0x629/3000034765e caller_uid=0, caller_gid=0{}) currently acquired locks
2023-02-10T05:33:35.813+0000 7f367345223  0 log_channel(cluster) log [WRN] : slow request 723.34452 seconds old, received at 2023-04-19T18:24:02.645006+0000: client_request(client.252712820:58191 unlink #0x20000008f81/posd 2023-04-19T18:24:02.644807+0000 caller_uid=0, caller_gid=0{}) currently requesting remote authpins

This is because when one file has hard links and when the first hard link was unlinked, it will remove the corresponding dentry and then try to move the corresponding CInode to stray dir. But since the nlink != 0 so it will try to reintergrated the stray dentry with another hardlink in the same MDS or remote MDS, which it will send a rename request to that MDS.

So after that if a client tries to unlink another hard link it will cause deadlock.

Actions #6

Updated by Xiubo Li 12 months ago

  • Subject changed from mds: fsstress.sh hangs with multimds to mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate(rename))
Actions #7

Updated by Xiubo Li 12 months ago

  • Subject changed from mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate(rename)) to mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename))
Actions #8

Updated by Xiubo Li 12 months ago

Please evict the corresponding client to unblock this deadlock. Mostly this should work, if not then please restart the MDS daemons.

Actions #13

Updated by Patrick Donnelly 11 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Target version changed from v18.0.0 to v19.0.0
  • Backport changed from pacific,quincy to reef,pacific,quincy
Actions #14

Updated by Backport Bot 11 months ago

  • Copied to Backport #61346: pacific: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added
Actions #15

Updated by Backport Bot 11 months ago

  • Copied to Backport #61347: reef: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added
Actions #16

Updated by Backport Bot 11 months ago

  • Copied to Backport #61348: quincy: mds: fsstress.sh hangs with multimds (deadlock between unlink and reintegrate straydn(rename)) added
Actions #17

Updated by Backport Bot 11 months ago

  • Tags set to backport_processed
Actions #20

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #62096: mds: infinite rename recursion on itself added
Actions #21

Updated by Xiubo Li 8 months ago

  • Status changed from Pending Backport to Resolved
Actions #22

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #61818: mds: deadlock between unlink and linkmerge added
Actions #23

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #56695: [RHEL stock] pjd test failures(a bug that need to wait the unlink to finish) added
Actions

Also available in: Atom PDF