Project

General

Profile

Actions

Bug #39349

closed

mds: cap revokes leak

Added by Xuehan Xu about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
nautilus,mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Recently, one of our clusters, after updating to 12.2.11, occasionally reports that "XXX clients failing to respond to capability release".
And in the its mds's log, there are log entries as following

client.14021393 isn't responding to mclientcaps(revoke), ino 0x10000b84763 pending pAsLsXsFsc issued pAsLsXsFsc, sent 124.624433 seconds ago

The odd thing is that the pending cap and issued cap are the same.

After a series of debug, we believe that this is due to the Capability::confirm_receipt method failing to judge whether there was a revoke correctly.

Actions #1

Updated by Xuehan Xu about 5 years ago

These exists possibilities as follows:

1. some req make mds do simple_xlock on a filelock that's in LOCK_xlockdone
2. the filelock goes into LOCK_LOCK_XLOCK, and send out a revoke for Fs;
3. the filelock goes into LOCK_XLOCK and set_trace_dist set the filelock to LOCK_XLOCKDONE and make _pending and _issued to "(cap->wanted() | likes) & allowed"
4. when the client ack the revoke, the mds would think that there wasn't a revoke since _issued and _pending are the same.

here's our log(the client involed is client.11576589):

2019-04-17 12:49:31.293174 7f3ca03ff700  7 mds.0.locker xlock_start on (ifile xlockdone x=1) on [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418940 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=1) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@2,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=3 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.293188 7f3ca03ff700  7 mds.0.locker simple_xlock on (ifile xlockdone x=1) on [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418940 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=1) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@2,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=3 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.293201 7f3ca03ff700  7 mds.0.locker issue_caps loner client.11576589 allowed=pAsLsXsFcb, xlocker allowed=pAsLsXsFcb, others allowed=pLsFcb on [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418940 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile lock->xlock x=1) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@2,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=3 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.293218 7f3ca03ff700 20 mds.0.locker  client.11576589 pending pAsLsXsFscb allowed pAsLsXsFcb wanted pAsxXsxFxwb
2019-04-17 12:49:31.293223 7f3ca03ff700  7 mds.0.locker    sending MClientCaps to client.11576589 seq 3 new pending pAsLsXsFcb was pAsLsXsFscb
2019-04-17 12:49:31.293232 7f3ca03ff700 20 mds.0.cache.ino(0x100005f26bd) encode_cap_message pfile 1 pauth 0 plink 0 pxattr 0 ctime 2019-04-17 12:49:31.214204
2019-04-17 12:49:31.293237 7f3ca03ff700 10 mds.0.772 send_message_client_counted client.11576589 seq 1409 client_caps(revoke ino 0x100005f26bd 21443326 seq 3 caps=pAsLsXsFcb dirty=- wanted=pAsxXsxFxwb follows 0 size 0/0 ts 2/0 mtime 2019-04-17 12:49:31.214204) v11
2019-04-17 12:49:31.293252 7f3ca03ff700 20 mds.0.locker  client.18672975 pending pLs allowed pLsFcb wanted -
2019-04-17 12:49:31.293259 7f3ca03ff700 10 mds.0.locker  got xlock on (ifile xlock x=2 by 0x7f3b7d3f2800) [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418940 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlock x=2 by 0x7f3b7d3f2800) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFcb/pAsLsXsFscb/pAsxXsxFxwb@3,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=4 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
.
.
.
.
.
2019-04-17 12:49:31.293691 7f3ca03ff700 10 mds.0.server journal_and_reply tracei 0x7f3c59caed00 tracedn 0
2019-04-17 12:49:31.293694 7f3ca03ff700 10 mds.0.locker set_xlocks_done on (ifile xlock x=2 by 0x7f3b7d3f2800) [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418942 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlock x=2 by 0x7f3b7d3f2800) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFcb/pAsLsXsFscb/pAsxXsxFxwb@3,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=4 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.293707 7f3ca03ff700 10 mds.0.server early_reply 0 ((0) Success) client_request(client.11576589:729501 setattr mtime=2019-04-17 12:49:31.274205 #0x100005f26bd 2019-04-17 12:49:31.274205 caller_uid=0, caller_gid=0{}) v2
2019-04-17 12:49:31.293717 7f3ca03ff700 20 mds.0.server set_trace_dist snapid head
2019-04-17 12:49:31.293719 7f3ca03ff700 10 mds.0.server set_trace_dist snaprealm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x7f3c96c13f80) len=48
2019-04-17 12:49:31.293722 7f3ca03ff700 20 mds.0.cache.ino(0x100005f26bd)  pfile 1 pauth 1 plink 1 pxattr 1 plocal 1 ctime 2019-04-17 12:49:31.274205 valid=1
2019-04-17 12:49:31.293726 7f3ca03ff700 10 mds.0.cache.ino(0x100005f26bd) encode_inodestat issuing pAsLsXsFscb seq 4
2019-04-17 12:49:31.293729 7f3ca03ff700 10 mds.0.cache.ino(0x100005f26bd) encode_inodestat caps pAsLsXsFscb seq 4 mseq 0 xattrv 0 len 0
2019-04-17 12:49:31.293734 7f3ca03ff700 20 mds.0.server set_trace_dist added in   [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418942 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=2) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@4,18672975=pLs/pAsLsXs/-@7},l=1157
.
.
.
.
.
.
2019-04-17 12:49:31.293978 7f3ca03ff700  7 mds.0.locker handle_client_caps async on 0x100005f26bd tid 0 follows 0 op update
2019-04-17 12:49:31.293981 7f3ca03ff700 20 mds.0.772 get_session have 0x7f3c98468b80 client.11576589 10.208.131.16:0/1422987817 state open
2019-04-17 12:49:31.293986 7f3ca03ff700 10 mds.0.locker  head inode [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418942 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=2) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@4,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=4 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.294002 7f3ca03ff700 10 mds.0.locker  follows 0 retains pAsLsXsFcb dirty - on [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418942 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=2) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@4,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=4 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
2019-04-17 12:49:31.294015 7f3ca03ff700 10 mds.0.locker _do_cap_update dirty - issued pAsLsXsFscb wanted pAsxXsxFxwb on [inode 0x100005f26bd [2,head] /kube/volumes/kubernetes-dynamic-pvc-4f0cd128-ac28-11e8-9a44-d249e7ccfcdd/data-5a1c3029a63d39c538da7697730eb506/nddir/gossip.data.10.209.136.57.push.data.sa.push.update.local auth v1418938 pv1418942 ap=5+0 truncating(14 to 0) s=14 n(v0 b14 1=1+0) (iauth sync->excl) (ifile xlockdone x=2) (ixattr sync->excl) (iversion lock w=2 last_client=11576589) caps={11576589=pAsLsXsFscb/pAsxXsxFxwb@4,18672975=pLs/pAsLsXs/-@7},l=11576589 | ptrwaiter=0 request=2 lock=4 caps=1 dirty=1 waiter=0 authpin=1 0x7f3c59caed00]
Actions #2

Updated by Patrick Donnelly about 5 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Xuehan Xu
  • Target version set to v15.0.0
  • Source set to Community (dev)
  • Backport set to nautilus,mimic,luminous
  • Pull request ID set to 27641
  • Component(FS) MDS added
Actions #4

Updated by Xuehan Xu about 5 years ago

Zheng Yan wrote:

already fixed by https://github.com/ceph/ceph/pull/26713

yes, this is already fixed, please close this issue

Actions #5

Updated by Zheng Yan about 5 years ago

  • Status changed from Fix Under Review to Closed
Actions

Also available in: Atom PDF