Project

General

Profile

Actions

Bug #65802

open

Quiesce and rename aren't properly syncrhonized

Added by Leonid Usov 14 days ago. Updated 8 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, quiesce
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Detected in this run: https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688480/

Quiesce error: Error quiescing set 'ec875e81': 110 (ETIMEDOUT)
later the dbench tool exited with a failed write. It's yet unclear whether the errors are related
Looks like an interlock between a queisce and a rename

  ""description"": ""internal op quiesce_path:mds.2:141 fp=#0x1"",
  ""initiated_at"": ""2024-05-04T01:22:26.068615+0000"",

  ""description"": ""peer_request(client.24916:109762.0 rename_prep)"",
  ""initiated_at"": ""2024-05-04T01:22:26.034865+0000"",

quiesce has iquiesce, ipolicy, iauth, ifile and currently is failing to rdlock ilink which is held by the rename
ilink xlock x=1 by request(client.24916:109762 nref=9 peer_to mds.0 sr=0x56032df13180)

Rename has all locks

      {
        ""time"": ""2024-05-04T01:22:26.039762+0000"",
        ""event"": ""acquired locks"" 
      }


and rename began before quiesce. It looks like the rename operation is missing a quiesce rdlock, or takes it too late.

Another case of a similar issue has a different look: this time, the rename op holds the shared quiesce lock, and the quiesce is pending, but, apparently, the rename is stalled: https://pulpito.ceph.com/pdonnell-2024-05-03_22:48:16-fs-wip-pdonnell-testing-20240503.163550-debug-distro-default-smithi/7688474/

{
  ""description"": ""internal op quiesce_inode:mds.2:35659 fp=#0x10000008328"",
  ""initiated_at"": ""2024-05-04T01:11:06.343933+0000"",
  ""age"": 45.904898622,
  ""duration"": 45.95364117,
  ""continuous"": true,
  ""type_data"": {
    ""result"": -2147483648,
    ""flag_point"": ""failed to acquire quiesce lock"",
    ""reqid"": {
      ""entity"": {
        ""type"": ""mds"",
        ""num"": 2
      },
      ""tid"": 35659
    },


rename op now holds the quiesce lock
""object_string"": ""[inode 0x10000008328 [...13,head] /volumes/qa/sv_1/59561494-6f93-4381-b330-4df34e01dbf5/client.0/tmp/payload.1/multiple_rsync_payload.196474/python3.9/site-packages/kubernetes/client/models/__pycache__/ auth{0=1,1=1} fragtree_t(*^3 000*^3 001*^3 011*^3 111*^3) v6918 ap=5 f(v1 m2024-05-04T01:10:58.241407+0000 885=885+0) n(v1 rc2024-05-04T01:10:58.241407+0000 b141736 887=886+1) old_inodes=1 (idft lock) (isnap sync r=1) (inest mix->lock(2) g=1 w=1 dirty) (ifile mix->lock(2) g=1 w=1) (iquiesce lock w=1) caps={15162=pAsLsXs/pAsLsXsFsxcral@14} | dirtyscattered=1 request=1 lock=4 importingcaps=0 importing=0 dirfrag=36 caps=1 stickydirs=0 dirwaiter=0 replicated=1 dirty=1 waiter=1 authpin=1 discoverbase=0 0x5588e8d4e100]"",
        ""lock"": {
          ""gather_set"": [],
          ""state"": ""lock"",
          ""type"": ""iquiesce"",
          ""is_leased"": false,
          ""num_rdlocks"": 0,
          ""num_wrlocks"": 1,
          ""num_xlocks"": 0,
          ""xlock_by"": {}
        }," 

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #66108: Quiesce timeout due to a remote xlock request blocked on the quiescelockNewLeonid Usov

Actions
Actions #1

Updated by Leonid Usov 13 days ago

  • Assignee set to Leonid Usov
Actions #2

Updated by Leonid Usov 12 days ago

With the help of @Kotresh Hiremath Ravishankar we have the picture of the deadlock:

1. the dest auth mds xlocks the linklock on both the src and the dest
2. one of the witnesses is the src auth. Since the linklock is replicated, it appears as xlocked on the witness
3. the src auth receives a quiesce request, which takes an authpin on the src and tries to rdlock the linklock
4. the witness is waiting for the src to freeze before it can respond to the rename driver (the dest auth)
5. quiesce is unable to rdlock the linklock while it's held by the rename driver, which is waiting for the witnesses to respond

Before we think how to recover from this deadlock, I have an approach of avoiding this altogether. The quiesce operation is taking the linklock for the sole reason of kicking the W capability clients. However, it's not really needed in this case, as there are no clients holding any cap on this lock - it's been xlocked for a while already.
So, the suggested approach is to avoid taking the locks which don't have any corresponding W capability issued to any of the clients. This will be safe if by the time we check this we will already hold the quiesce lock, meaning that no W capabilities will be issued concurrently

Actions #3

Updated by Leonid Usov 12 days ago

Update: having implemented the above I realized that it's just an optimization. The real issue we had was due to the `skip_quiesce` that we pass to `acquire_locks` in the peer lock requests. This is what prevented the src auth from getting the shared quiesce lock on the inode due to the `ilink` xlock that it performed on behalf of the rename driver.

The local quiesce lock must be taken automatically also in the dispatch_peer_request

        // don't add quiescelock, let the peer acquire that lock themselves
    if (!mds->locker->acquire_locks(mdr, lov, nullptr, {}, false, true))
      return;

To avoid ambiguity, I'll switch to a different interface to the acquire_locks. Whether the quiesce lock should be taken automatically or not will be controlled by inspecting the passed `lov` vector.

Actions #4

Updated by Leonid Usov 12 days ago

  • Status changed from New to In Progress
  • Pull request ID set to 57332
Actions #5

Updated by Leonid Usov 8 days ago

The issue continues to give.

It appeared that the src file must have had the FROZEN_AUTHPIN state since the beginning of the rename op, which we can see was the case:

2024-05-04T01:22:26.033+0000 7f8b1f5b4700 10 mds.2.server handle_peer_auth_pin request(client.24916:109762 nref=2 peer_to mds
.0 sr=0x56032df13180)
2024-05-04T01:22:26.033+0000 7f8b1f5b4700 15 mds.2.server  nonblocking=0 bypassfreezing=0
2024-05-04T01:22:26.033+0000 7f8b1f5b4700 20 mds.2.cache.dir(0x10000000229.001*) lookup (__QB5.MB, 'head')
2024-05-04T01:22:26.033+0000 7f8b1f5b4700 20 mds.2.cache.dir(0x10000000229.001*)   hit -> (__QB5.MB,head)
2024-05-04T01:22:26.033+0000 7f8b1f5b4700 10 mds.2.server  freezing auth pin on [inode 0x30000000cd7 [10,head] /volumes/qa/sv
_1/62599e65-cce8-43a0-a1ec-a0eb963a59d9/client.0/tmp/clients/client0/~dmtmp/PARADOX/__QB5.MB auth{0=1} v9755 DIRTYPARENT s=0
n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) cr={24916=0-4194304@f} caps={24916=pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@4},l=2
4916 | request=0 caps=1 dirtyparent=1 replicated=1 dirty=1 0x56032e3b7700]
...
2024-05-04T01:22:26.038+0000 7f8b1f5b4700 10 mds.2.cache.ino(0x30000000cd7) auth_pin by 0x56032e3d1800 on [inode 0x30000000cd7 [10,head] /volumes/qa/sv_1/62599e65-cce8-43a0-a1ec-a0eb963a59d9/client.0/tmp/clients/client0/~dmtmp/PARADOX/__QB5.MB auth{0=1} v9755 pv9758 ap=4 DIRTYPARENT FROZEN_AUTHPIN s=0 n(v0 1=1+0)->n(v0 rc2024-05-04T01:22:26.018993+0000 b4096 1=1+0)/n(v0 1=1+0) (iauth excl) (ilink sync->lock) (ifile excl w=1) (ixattr excl) (iversion lock x=1 by request(client.24916:109762 nref=7 peer_to mds.0 sr=0x56032dfdce00)) cr={24916=0-4194304@f} caps={24916=pAsxXsxFsxcrwb/pAsxXsxFsxcrwb@5},l=24916 | request=1 lock=3 caps=1 frozen=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 0x56032e3b7700] now 4

What we see above is an unconditional authpin increment from the _do_cap_update method. This is fine, the authpin will be dropped once the caps get issued, but this update is necessary to reproduce our problem.

With the ap count now being 4, we get to the handle_peer_rename_prep method of the server with one too many to freeze the inode:

      // freeze?
      // we need this to
      //  - avoid conflicting lock state changes
      //  - avoid concurrent updates to the inode
      //     (this could also be accomplished with the versionlock)
      int allowance = 3; // 1 for the mdr auth_pin, 1 for the link lock, 1 for the snap lock
      dout(10) << " freezing srci " << *srcdnl->get_inode() << " with allowance " << allowance << dendl;
      bool frozen_inode = srcdnl->get_inode()->freeze_inode(allowance);
2024-05-04T01:22:26.038+0000 7f8b1f5b4700 10 mds.2.server handle_peer_rename_prep request(client.24916:109762 nref=7 peer_to mds.0 sr=0x56032df13180) #0x10000000229/__QB5.MB to #0x10000000229/ANSWER.MB
...
2024-05-04T01:22:26.039+0000 7f8b1f5b4700 10 mds.2.server  freezing srci [inode 0x30000000cd7 [10,head] /volumes/qa/sv_1/62599e65-cce8-43a0-a1ec-a0eb963a59d9/client.0/tmp/clients/client0/~dmtmp/PARADOX/__QB5.MB auth{0=1} v9755 pv9758 ap=4 DIRTYPARENT FROZEN_AUTHPIN s=0 n(v0 1=1+0)->n(v0 rc2024-05-04T01:22:26.018993+0000 b4096 1=1+0) (iauth excl) (ilink xlock x=1 by request(client.24916:109762 nref=10 peer_to mds.0 sr=0x56032df13180)) (ifile excl w=1) (ixattr excl) (iversion lock x=1 by request(client.24916:109762 nref=10 peer_to mds.0 sr=0x56032df13180)) cr={24916=0-4194304@f} caps={24916=pAsxXsxFsxcrwb/pAsxXsxFsxcrwb@5},l=24916 | ptrwaiter=1 request=1 lock=4 caps=1 frozen=0 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56032e3b7700] with allowance 3
...
2024-05-04T01:22:26.093+0000 7f8b1f5b4700 10 mds.2.cache dispatch_quiesce_inode: scheduling op to quiesce [inode 0x30000000cd7 [10,head] /volumes/qa/sv_1/62599e65-cce8-43a0-a1ec-a0eb963a59d9/client.0/tmp/clients/client0/~dmtmp/PARADOX/__QB5.MB auth{0=1} v9755 pv9758 ap=4 DIRTYPARENT FREEZING=3 s=0 n(v0 1=1+0)->n(v0 rc2024-05-04T01:22:26.018993+0000 b4096 1=1+0) (iauth excl) (ilink xlock x=1 by request(client.24916:109762 nref=9 peer_to mds.0 sr=0x56032df13180)) (ifile excl w=1) (ixattr excl) (iversion lock x=1 by request(client.24916:109762 nref=9 peer_to mds.0 sr=0x56032df13180)) cr={24916=0-4194304@f} caps={24916=pAsxXsxFsxcrwb/pAsxXsxFsxcrwb@5},l=24916 | ptrwaiter=1 request=1 lock=4 caps=1 freezing=1 frozen=0 dirtyparent=1 replicated=1 dirty=1 waiter=1 authpin=1 0x56032e3b7700]

So, by the time the quiesce op gets to this node it still has 4 authpins, but its state is now ... freezing! ap=4 DIRTYPARENT FREEZING=3

The quiesce op should have not been able to take the authpin, as it is the case in all my reproduction attempts, but due to the node having lost its FROZEN_AUTHPIN state, the quiesce progresses, and then later deadlocks with the rename op.

So, the issue is in how the FROZEN_AUTHPIN state is managed, where did it go? Turns out, the rename_prep deliberately clears it:

      int allowance = 3; // 1 for the mdr auth_pin, 1 for the link lock, 1 for the snap lock
      dout(10) << " freezing srci " << *srcdnl->get_inode() << " with allowance " << allowance << dendl;
      bool frozen_inode = srcdnl->get_inode()->freeze_inode(allowance);

      // unfreeze auth pin after freezing the inode to avoid queueing waiters
      if (srcdnl->get_inode()->is_frozen_auth_pin())
        mdr->unfreeze_auth_pin();

      if (!frozen_inode) {
        srcdnl->get_inode()->add_waiter(CInode::WAIT_FROZEN, new C_MDS_RetryRequest(mdcache, mdr));
        return;
      }

I see several issues:

1. The quiesce code path of bypass_freezing should honor the AUTHPIN_FROZEN state as if it was "frozen", i.e. not allowing the authpin
2. The rename_prep shouldn't clear the AUTHPIN_FROZEN bit until it's frozen again
3. The peer locking should take the local quiesce lock (see above). If that were the case, the rename wouldn't have progressed further than requesting the remote lock of the src ilink, or if it made it, then the quiesce wouldn't be able to reach the src node, preventing it from freezing and thus deadlocking on the ilink lock

Actions #6

Updated by Leonid Usov 1 day ago

  • Blocked by Bug #66108: Quiesce timeout due to a remote xlock request blocked on the quiescelock added
Actions #7

Updated by Leonid Usov 1 day ago

  • Blocked by deleted (Bug #66108: Quiesce timeout due to a remote xlock request blocked on the quiescelock)
Actions #8

Updated by Leonid Usov 1 day ago

  • Related to Bug #66108: Quiesce timeout due to a remote xlock request blocked on the quiescelock added
Actions

Also available in: Atom PDF