Project

General

Profile

Actions

Bug #45935

closed

mds: cap revoking requests didn't success when the client doing reconnection

Added by Xiubo Li almost 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The kclient node log:

 53 2020-06-03T01:04:06.610312+00:00 smithi070 kernel: [  825.023480] ceph: mds0 reconnect start
 54 2020-06-03T01:04:06.658346+00:00 smithi070 kernel: [  825.069176] ceph: mds0 reconnect success
 55 2020-06-03T01:04:08.634315+00:00 smithi070 kernel: [  827.045544] ceph: mds0 recovery completed 

mds.c daemon send the revoke request:

181639 2020-06-03T01:04:08.637+0000 7fbd51f2a700 20 mds.0.locker  client.4394 pending pAsLsXsFsc allowed pAsxLsXsxFrw wanted pAsxXsxFxcwb  
181640 2020-06-03T01:04:08.637+0000 7fbd51f2a700  7 mds.0.locker    sending MClientCaps to client.4394 seq 1 new pending pAsLsXs was pAsLsX       sFsc
181641 2020-06-03T01:04:08.637+0000 7fbd51f2a700 20 mds.0.cache.ino(0x10000000392) encode_cap_message pfile 0 pauth 0 plink 0 pxattr 0 ctim       e 2020-06-03T01:00:43.722299+0000
181642 2020-06-03T01:04:08.637+0000 7fbd51f2a700 10 mds.0.cache.ino(0x10000000392)     including xattrs v 1
181643 2020-06-03T01:04:08.637+0000 7fbd51f2a700 10 mds.0.23 send_message_client_counted client.4394 seq 5 client_caps(revoke ino 0x1000000       0392 65 seq 1 caps=pAsLsXs dirty=- wanted=pAsxXsxFxcwb follows 0 size 104857600/213909504 ts 1/18446744073709551615 mtime 2020-06-03       T01:00:43.722299+0000 xattrs(v=1 l=4)) v11
181644 2020-06-03T01:04:08.637+0000 7fbd51f2a700  1 -- [v2:172.21.15.107:6834/313709009,v1:172.21.15.107:6835/313709009] --> v1:172.21.15.7       0:0/3065332795 -- client_caps(revoke ino 0x10000000392 65 seq 1 caps=pAsLsXs dirty=- wanted=pAsxXsxFxcwb follows 0 size 104857600/21       3909504 ts 1/18446744073709551615 mtime 2020-06-03T01:00:43.722299+0000 xattrs(v=1 l=4)) v11 -- 0x56150dcf3c00 con 0x56150c865800
181645 2020-06-03T01:04:08.637+0000 7fbd51f2a700 10 mds.0.locker eval done
181646 2020-06-03T01:04:08.637+0000 7fbd51f2a700  1 -- [v2:172.21.15.107:6834/313709009,v1:172.21.15.107:6835/313709009] --> [v2:172.21.15.       107:6802/15489,v1:172.21.15.107:6803/15489] -- osd_op(unknown.0.23:31 3.3 3:ddb4e4d7:::200.00000002:head [write 1516350~1094 in=1094       b] snapc 0=[] ondisk+write+known_if_redirected+full_force e26) v8 -- 0x56150d6172c0 con 0x56150c864400

mds.c received the cap update request from kclient, but didn't trigger cap->confirm_receipt() to remove the cap from the revoking list:

189429 2020-06-03T01:04:09.249+0000 7fbd51f2a700  1 -- [v2:172.21.15.107:6834/313709009,v1:172.21.15.107:6835/313709009] <== client.4394 v1       :172.21.15.70:0/3065332795 78 ==== client_caps(update ino 0x10000000392 65 seq 1 caps=pAsLsXs dirty=- wanted=pAsxXsxFxwb follows 0 s       ize 104857600/0 mtime 2020-06-03T01:04:03.072493+0000 xattrs(v=17179869185 l=0)) v10 ==== 236+0+0 (unknown 2146284701 0 0) 0x56150d5       17300 con 0x56150c865800
189430 2020-06-03T01:04:09.249+0000 7fbd51f2a700  7 mds.0.locker handle_client_caps  on 0x10000000392 tid 0 follows 0 op update flags 0x0
189431 2020-06-03T01:04:09.249+0000 7fbd51f2a700 20 mds.0.23 get_session have 0x56150c8e3a80 client.4394 v1:172.21.15.70:0/3065332795 state        open
189432 2020-06-03T01:04:09.249+0000 7fbd51f2a700 10 mds.0.locker  head inode [inode 0x10000000392 [2,head] /client.0/tmp/tmp/data/datafile7       1 auth v2721 dirtyparent s=104857600 n(v0 rc2020-06-03T01:04:03.072493+0000 b104857600 1=1+0)/n(v0 rc2020-06-03T01:00:43.722299+0000        b104857600 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4394=0-213909504@1} caps={4394=pAsxLsXsxFsxcrwb/pAsxX       sxFxcwb@3},l=4394 | ptrwaiter=0 lock=0 caps=1 dirtyrstat=1 dirtyparent=1 dirty=1 authpin=0 0x56150dc1d000]
189433 2020-06-03T01:04:09.249+0000 7fbd51f2a700 10 mds.0.locker  follows 0 retains pAsLsXs dirty - on [inode 0x10000000392 [2,head] /clien       t.0/tmp/tmp/data/datafile71 auth v2721 dirtyparent s=104857600 n(v0 rc2020-06-03T01:04:03.072493+0000 b104857600 1=1+0)/n(v0 rc2020-       06-03T01:00:43.722299+0000 b104857600 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4394=0-213909504@1} caps={4       394=pAsxLsXsxFsxcrwb/pAsxXsxFxcwb@3},l=4394 | ptrwaiter=0 lock=0 caps=1 dirtyrstat=1 dirtyparent=1 dirty=1 authpin=0 0x56150dc1d000]
189434 2020-06-03T01:04:09.249+0000 7fbd51f2a700 10 mds.0.locker  wanted pAsxXsxFxcwb -> pAsxXsxFxwb
189435 2020-06-03T01:04:09.249+0000 7fbd51f2a700 10 mds.0.locker _do_cap_update dirty - issued pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb on [inod       e 0x10000000392 [2,head] /client.0/tmp/tmp/data/datafile71 auth v2721 dirtyparent s=104857600 n(v0 rc2020-06-03T01:04:03.072493+0000        b104857600 1=1+0)/n(v0 rc2020-06-03T01:00:43.722299+0000 b104857600 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock)        cr={4394=0-213909504@1} caps={4394=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@3},l=4394 | ptrwaiter=0 lock=0 caps=1 dirtyrstat=1 dirtyparent=1 dir       ty=1 authpin=0 0x56150dc1d000]

Then at last the revoking caps timeout:


278783 2020-06-03T01:05:11.103+0000 7fbd50727700 20 mds.0.locker caps_tick age = 62.463717 client.4394.0x10000000392
278784 2020-06-03T01:05:11.103+0000 7fbd50727700  0 log_channel(cluster) log [WRN] : client.4394 isn't responding to mclientcaps(revoke), i       no 0x10000000392 pending pAsxLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 62.463717 seconds ago
278785 2020-06-03T01:05:11.103+0000 7fbd50727700 20 mds.0.locker caps_tick client.4394 isn't responding to mclientcaps(revoke), ino 0x10000       000392 pending pAsxLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 62.463717 seconds ago


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #46190: octopus: mds: cap revoking requests didn't success when the client doing reconnection ResolvedWei-Chung ChengActions
Copied to CephFS - Backport #46191: nautilus: mds: cap revoking requests didn't success when the client doing reconnection ResolvedWei-Chung ChengActions
Actions #1

Updated by Xiubo Li almost 4 years ago

  • Status changed from New to In Progress
Actions #2

Updated by Patrick Donnelly almost 4 years ago

  • Target version set to v16.0.0
  • Source set to Development

Xiubo, which teuthology test is this from?

Actions #3

Updated by Xiubo Li almost 4 years ago

Patrick Donnelly wrote:

Xiubo, which teuthology test is this from?

Sorry, forgot to copy it, this the second issue from your mail:

http://pulpito.ceph.com/pdonnell-2020-06-03_00:37:03-kcephfs-wip-pdonnell-testing-20200602.153347-distro-basic-smithi/5114808/

Actions #4

Updated by Xiubo Li almost 4 years ago

From the mds.c logs, we can see that:

client_caps(revoke ino 0x10000000392 65 seq 1 > pending pAsLsXs was pAsLsXsFsc
client_caps(grant ino 0x10000000392 65 seq 2 > pending pAsxLsXsxFrw was pAsLsXs
client_caps(grant ino 0x10000000392 65 seq 3 > pending pAsxLsXsxFsxcrwb was pAsxLsXsxFrw
client_caps(update ino 0x10000000392 65 seq 1 > caps=pAsLsXs dirty=- wanted=pAsxXsxFxwb
client_caps(update ino 0x10000000392 65 seq 3 ==> caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=pAsxXsxFxwb

Before the revoke is completed, there has 2 new grant issued, this will cause the cap->confirm_receipt() won't remove the cap from the revoking_caps list.

Actions #5

Updated by Xiubo Li almost 4 years ago

  • Pull request ID set to 35518
Actions #6

Updated by Xiubo Li almost 4 years ago

  • Status changed from In Progress to Fix Under Review
Actions #7

Updated by Xiubo Li almost 4 years ago

Xiubo Li wrote:

From the mds.c logs, we can see that:

client_caps(revoke ino 0x10000000392 65 seq 1 > pending pAsLsXs was pAsLsXsFsc
client_caps(grant ino 0x10000000392 65 seq 2 > pending pAsxLsXsxFrw was pAsLsXs
client_caps(grant ino 0x10000000392 65 seq 3 > pending pAsxLsXsxFsxcrwb was pAsxLsXsxFrw
client_caps(update ino 0x10000000392 65 seq 1 > caps=pAsLsXs dirty=- wanted=pAsxXsxFxwb
client_caps(update ino 0x10000000392 65 seq 3 ==> caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=pAsxXsxFxwb

Before the revoke is completed, there has 2 new grant issued, this will cause the cap->confirm_receipt() won't remove the cap from the revoking_caps list.

The second grant will remove the revoking caps from cap->_revokes and update the cap->_pending, but left the cap item in the Locker::revoking_caps_by_client and Locker::revoking_caps lists. And even the revoke reply arrived, it won't get any chance to be removed from the two lists.

Actions #8

Updated by Xiubo Li almost 4 years ago

Xiubo Li wrote:

From the mds.c logs, we can see that:

client_caps(revoke ino 0x10000000392 65 seq 1 > pending pAsLsXs was pAsLsXsFsc
client_caps(grant ino 0x10000000392 65 seq 2 > pending pAsxLsXsxFrw was pAsLsXs
client_caps(grant ino 0x10000000392 65 seq 3 > pending pAsxLsXsxFsxcrwb was pAsxLsXsxFrw

The revoke request was issued just after the MDS replayed the pending cap flush requests from kclient, which were arrived when the MDS daemon was in reconnect state. And just after the cap flush requests replayed, the revoke request was issued.

And just after that the MDS daemon will run `mdcache->reissue_all_caps();`, and it will issue the followed cap grant requests.

Because the grant requests were issued before the update replay for the revoke request, and the Locker's _pending/_issued... member were already all updated by them.

client_caps(update ino 0x10000000392 65 seq 1 > caps=pAsLsXs dirty=- wanted=pAsxXsxFxwb
client_caps(update ino 0x10000000392 65 seq 3 ==> caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=pAsxXsxFxwb

Before the revoke is completed, there has 2 new grant issued, this will cause the cap->confirm_receipt() won't remove the cap from the revoking_caps list.

Actions #9

Updated by Xiubo Li almost 4 years ago

LOCK_MIX is is a transition state for muti MDSs could do read/write
at the same time, but the Fcb caps are not allowed.

When the Fc cap is issued, the lock state should be excl.

Actions #10

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to octopus,nautilus
Actions #11

Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #46190: octopus: mds: cap revoking requests didn't success when the client doing reconnection added
Actions #12

Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #46191: nautilus: mds: cap revoking requests didn't success when the client doing reconnection added
Actions #13

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF