Bug #46904
closedkclient: cluster [WRN] client.4478 isn't responding to mclientcaps(revoke)
0%
Description
"2020-08-12T00:25:02.919422+0000 mds.f (mds.1) 1 : cluster [WRN] client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago" in cluster log
2020-08-12T00:25:02.915+0000 7f3017f1c700 10 mds.1.locker scatter_tick 2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick 1 revoking caps 2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick age = 60.556674 client.4478.0x1000000076a 2020-08-12T00:25:02.915+0000 7f3017f1c700 0 log_channel(cluster) log [WRN] : client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago 2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.locker caps_tick client.4478 isn't responding to mclientcaps(revoke), ino 0x1000000076a pending pAsLsXsFr issued pAsLsXsFrw, sent 60.556674 seconds ago 2020-08-12T00:25:02.915+0000 7f3017f1c700 20 mds.1.bal handle_export_pins export_pin_queue size=0 2020-08-12T00:25:02.915+0000 7f3017f1c700 15 mds.1.bal tick tick last_sample now 899.914346s 2020-08-12T00:25:02.915+0000 7f3017f1c700 15 mds.1.cache show_subtrees ... 2020-08-12T00:25:03.907+0000 7f3015717700 20 mds.1.cache upkeep thread waiting interval 1s 2020-08-12T00:25:04.603+0000 7f301971f700 1 -- [v2:172.21.15.148:6839/1573709161,v1:172.21.15.148:6841/1573709161] <== client.4478 v1:172.21.15.131:0/1823065202 120 ==== client_caps(update ino 0x1000000076a 224 seq 8 tid 1726 caps=pAsLsXs dirty=Fw wanted=- follows 1 mseq 1 size 104857600/0 mtime 2020-08-12T00:22:35.869663+0000) v10 ==== 236+0+0 (unknown 3013966134 0 0) 0x5609975b5500 con 0x5609964fb000 2020-08-12T00:25:04.603+0000 7f301971f700 7 mds.1.locker handle_client_caps on 0x1000000076a tid 1726 follows 1 op update flags 0x0 2020-08-12T00:25:04.603+0000 7f301971f700 20 mds.1.8 get_session have 0x560997378880 client.4478 v1:172.21.15.131:0/1823065202 state open 2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker head inode [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080] 2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.cache pick_inode_snap follows 1 on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080] 2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker follows 1 retains pAsLsXs dirty Fw on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080] 2020-08-12T00:25:04.603+0000 7f301971f700 7 mds.1.locker flush client.4478 dirty Fw seq 8 on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/pAsxXsxFxwb@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080] 2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker wanted pAsxXsxFxwb -> - 2020-08-12T00:25:04.603+0000 7f301971f700 10 mds.1.locker _do_cap_update dirty Fw issued pAsLsXs wanted - on [inode 0x1000000076a [2,head] /client.1/tmp/tmp/data/datafile54 auth{0=1} v2805 ap=2 dirtyparent s=104857600 n(v0 rc2020-08-12T00:22:35.869663+0000 b104857600 1=1+0) (ifile mix->sync) (iversion lock) cr={4478=0-213909504@1} caps={4478=pAsLsXs/-@8} mcw={0=AsXs} | ptrwaiter=0 lock=0 importingcaps=0 caps=1 dirtyparent=1 replicated=1 dirty=1 waiter=0 authpin=1 0x56099745b080] 2020-08-12T00:25:04.603+0000 7f301971f700 20 mds.1.locker inode is file
Just after it timedout and give a WRN, 2 seconds later the mds received the revoke reply.
Updated by Jeff Layton over 3 years ago
Hmm, ok. So the problem eventually resolved itself? Is the issue just that it just took a bit too long?
Updated by Xiubo Li over 3 years ago
Jeff Layton wrote:
Hmm, ok. So the problem eventually resolved itself? Is the issue just that it just took a bit too long?
Yeah, I have checked 3 test failure cases, they all just took a little longer than expected, it usually several seconds. Since there has no any kmesg log, I couldn't know what happened in kclient. I was trying to reproduce it locally, but failed.
Updated by Jeff Layton over 3 years ago
Hmm...I wonder if we have something in the kclient that is delaying revoke replies. I'll have to think about how we might track that down.
Updated by Jeff Layton over 3 years ago
I suspect what's happening is that we end up putting the last reference on some caps, queue the thing up for a release and then don't actually send it yet because we're waiting for caps_wanted_delay_max to time out. Note:
#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT 60 /* cap release delay */
...so we'll try to hold on to caps for that long in some cases. The fact that it clears up just after 60s makes me think that this is just a case of interlocking timeouts between the above constant and the warning on the MDS (which also appears to be on a 60s timeout, and probably starts just before this one).
If this is easily reproducible, then it might be good to see whether mounting with something like "-o caps_wanted_delay_max=30" helps this. That's not a real fix though, as we don't want to hold revoked caps any longer than necessary, and it seems likely that that's what's occurring here.
Updated by Xiubo Li over 3 years ago
This is the same issue with https://tracker.ceph.com/issues/47565.
Updated by Xiubo Li over 3 years ago
Jeff Layton wrote:
I suspect what's happening is that we end up putting the last reference on some caps, queue the thing up for a release and then don't actually send it yet because we're waiting for caps_wanted_delay_max to time out. Note:
#define CEPH_CAPS_WANTED_DELAY_MAX_DEFAULT 60 /* cap release delay */
...so we'll try to hold on to caps for that long in some cases. The fact that it clears up just after 60s makes me think that this is just a case of interlocking timeouts between the above constant and the warning on the MDS (which also appears to be on a 60s timeout, and probably starts just before this one).
If this is easily reproducible, then it might be good to see whether mounting with something like "-o caps_wanted_delay_max=30" helps this. That's not a real fix though, as we don't want to hold revoked caps any longer than necessary, and it seems likely that that's what's occurring here.
Sorry Jeff, I might miss this.
Checked the kclient code, yeah, it possiblly delayed 60 seconds to ack to MDS.
How about short the delay value in revocation case ? The default is 60s, and let's use 30s in revocation case ?
Updated by Xiubo Li over 1 year ago
The MDS was waiting for Fw caps:
caps={4478=pAsLsXsFr/pAsLsXsFrw/pAsxXsxFxwb@8}
And I checked the kclient code again more carefully, there only one change could cause the kclient to defer the cap updating, that is it was waiting for the async create to be finished.
So we need to make sure the MDS won't stuck the async create for any reason, such as defer flusing the mdlog.
Updated by Xiubo Li over 1 year ago
- Project changed from Linux kernel client to CephFS
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 48506
- Component(FS) kceph added
Updated by Xiubo Li over 1 year ago
- Project changed from CephFS to Linux kernel client
Fixed it in kernel and the patchwork link: https://patchwork.kernel.org/project/ceph-devel/list/?series=686074
Updated by Xiubo Li over 1 year ago
- Status changed from Fix Under Review to Resolved
Applied to mainline, closing this tracker.