Project

General

Profile

Actions

Bug #46904

closed

kclient: cluster [WRN] client.4478 isn't responding to mclientcaps(revoke)

Added by Xiubo Li over 3 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

From https://pulpito.ceph.com/xiubli-2020-08-11_04:26:09-multimds-wip-lxb-testing-defer-2020-0811-1023-distro-basic-smithi/5340304/teuthology.log

"2020-08-12T00:25:02.919422+0000 mds.f (mds.1) 1 : cluster [WRN] client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago" in cluster log

From https://pulpito.ceph.com/xiubli-2020-08-11_04:26:09-multimds-wip-lxb-testing-defer-2020-0811-1023-distro-basic-smithi/5340304/smithi148/log/ceph-mds.f.log.gz


2020-08-12T00:25:02.915+0000 7f3017f1c700 10 mds.1.locker scatter_tick
2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick 1 revoking caps
2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick age = 60.556674 client.4478.0x1000000076a
2020-08-12T00:25:02.915+0000 7f3017f1c700  0 log_channel(cluster) log [WRN] : client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago
2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago
2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.bal handle_export_pins export_pin_queue size=0
2020-08-12T00:25:02.915+0000 7f3017f1c700 15 mds.1.bal tick tick last_sample now 899.914346s
2020-08-12T00:25:02.915+0000 7f3017f1c700 15 mds.1.cache show_subtrees
...
2020-08-12T00:25:03.907+0000 7f3015717700 20 mds.1.cache upkeep thread waiting interval 1s
2020-08-12T00:25:04.603+0000 7f301971f700  1 -- [v2:172.21.15.148:6839/1573709161,v1:172.21.15.148:6841/1573709161] <== client.4478 v1:172.21.15.131:0/1823065202 120 ==== client_caps(update ino 0x1000000076a 224 seq 8 tid 1726 caps=pAsLsXs dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2020-08-12T00:22:35.869663+0000) v10 ==== 236+0+0 (unknown 3013966134 0 0) 0x5609975b5500 con 0x5609964fb000
2020-08-12T00:25:04.603+0000 7f301971f700  7 mds.1.locker handle_client_caps  on 0x1000000076a tid 1726 follows 1 op update flags 0x0
2020-08-12T00:25:04.603+0000 7f301971f700 20 mds.1.8 get_session have 0x560997378880 client.4478 v1:172.21.15.131:0/1823065202 state open
2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker  head inode [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080]
2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.cache pick_inode_snap follows 1 on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080]
2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker  follows 1 retains pAsLsXs dirty Fw on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080]
2020-08-12T00:25:04.603+0000 7f301971f700  7 mds.1.locker  flush client.4478 dirty Fw seq 8 on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080]
2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker  wanted pAsxXsxFxwb -> -
2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker _do_cap_update dirty Fw issued pAsLsXs wanted - on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/-@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080]
2020-08-12T00:25:04.603+0000 7f301971f700 20 mds.1.locker inode is file

Just after it timedout and give a WRN, 2 seconds later the mds received the revoke reply.

Actions #1

Updated by Jeff Layton over 3 years ago

Hmm, ok. So the problem eventually resolved itself? Is the issue just that it just took a bit too long?

Actions #2

Updated by Xiubo Li over 3 years ago

Jeff Layton wrote:

Hmm, ok. So the problem eventually resolved itself? Is the issue just that it just took a bit too long?

Yeah, I have checked 3 test failure cases, they all just took a little longer than expected, it usually several seconds. Since there has no any kmesg log, I couldn't know what happened in kclient. I was trying to reproduce it locally, but failed.

Actions #3

Updated by Jeff Layton over 3 years ago

Hmm...I wonder if we have something in the kclient that is delaying revoke replies. I'll have to think about how we might track that down.

Actions #4

Updated by Jeff Layton over 3 years ago

I suspect what's happening is that we end up putting the last reference on some caps, queue the thing up for a release and then don't actually send it yet because we're waiting for caps_wanted_delay_max to time out. Note:

#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT     60  /* cap release delay */

...so we'll try to hold on to caps for that long in some cases. The fact that it clears up just after 60s makes me think that this is just a case of interlocking timeouts between the above constant and the warning on the MDS (which also appears to be on a 60s timeout, and probably starts just before this one).

If this is easily reproducible, then it might be good to see whether mounting with something like "-o caps_wanted_delay_max=30" helps this. That's not a real fix though, as we don't want to hold revoked caps any longer than necessary, and it seems likely that that's what's occurring here.

Actions #5

Updated by Xiubo Li over 3 years ago

This is the same issue with https://tracker.ceph.com/issues/47565.

Actions #6

Updated by Xiubo Li over 3 years ago

Jeff Layton wrote:

I suspect what's happening is that we end up putting the last reference on some caps, queue the thing up for a release and then don't actually send it yet because we're waiting for caps_wanted_delay_max to time out. Note:

#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT 60 /* cap release delay */

...so we'll try to hold on to caps for that long in some cases. The fact that it clears up just after 60s makes me think that this is just a case of interlocking timeouts between the above constant and the warning on the MDS (which also appears to be on a 60s timeout, and probably starts just before this one).

If this is easily reproducible, then it might be good to see whether mounting with something like "-o caps_wanted_delay_max=30" helps this. That's not a real fix though, as we don't want to hold revoked caps any longer than necessary, and it seems likely that that's what's occurring here.

Sorry Jeff, I might miss this.

Checked the kclient code, yeah, it possiblly delayed 60 seconds to ack to MDS.

How about short the delay value in revocation case ? The default is 60s, and let's use 30s in revocation case ?

Actions #7

Updated by Xiubo Li almost 2 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Xiubo Li over 1 year ago

The MDS was waiting for Fw caps:

caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8}

And I checked the kclient code again more carefully, there only one change could cause the kclient to defer the cap updating, that is it was waiting for the async create to be finished.

So we need to make sure the MDS won't stuck the async create for any reason, such as defer flusing the mdlog.

Actions #9

Updated by Xiubo Li over 1 year ago

  • Project changed from Linux kernel client to CephFS
  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 48506
  • Component(FS) kceph added
Actions #10

Updated by Xiubo Li over 1 year ago

  • Project changed from CephFS to Linux kernel client
Actions #11

Updated by Xiubo Li over 1 year ago

  • Status changed from Fix Under Review to Resolved

Applied to mainline, closing this tracker.

Actions

Also available in: Atom PDF