Project

General

Profile

Actions

Bug #65536

open

mds: after the unresponsive client was evicted the blocked slow requests were not successfully cleaned up

Added by Xiubo Li 13 days ago. Updated 13 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Firstly a client.188978:3 lookup #0x10000000000/csi client request came and then was added to the waiter list:

2024-04-16T10:37:22.345971660Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] <== client.188978 10.128.2.216:0/1255716376 5 ==== client_request(client.188978:3 lookup #0x10000000000/csi 2024-04-16T10:37:22.344097+0000 caller_uid=0, caller_gid=0{}) ==== 171+0+0 (secure 0 0 0) 0x562fa71e3500 con 0x562fa80dd400                                                                                                                                
2024-04-16T10:37:22.345997635Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  4 mds.0.server handle_client_request client_request(client.188978:3 lookup #0x10000000000/csi 2024-04-16T10:37:22.344097+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:37:22.346010537Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.80 get_session have 0x562fa774aa00 client.188978 10.128.2.216:0/1255716376 state open
2024-04-16T10:37:22.346022260Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 15 mds.0.server  oldest_client_tid=2
2024-04-16T10:37:22.346037677Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.cache request_start request(client.188978:3 nref=2 cr=0x562fa71e3500)
2024-04-16T10:37:22.346050365Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.server dispatch_client_request client_request(client.188978:3 lookup #0x10000000000/csi 2024-04-16T10:37:22.344097+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:37:22.346061484Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.cache traverse: opening base ino 0x10000000000 snap head
2024-04-16T10:37:22.346072524Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 12 mds.0.cache traverse: path seg depth 0 'csi' snapid head
2024-04-16T10:37:22.346083810Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.cache.dir(0x10000000000) lookup (csi, 'head')
2024-04-16T10:37:22.346095866Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.cache.dir(0x10000000000)   hit -> (csi,head)
2024-04-16T10:37:22.346107337Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.cache found target [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=1 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346125897Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.cache path_traverse finish on snapid head
...
2024-04-16T10:37:22.346523908Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.locker acquire_locks request(client.188978:3 nref=3 cr=0x562fa71e3500)
2024-04-16T10:37:22.346534976Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  lov = [LockOp(l=(ilink sync),f=0x1),LockOp(l=(iauth sync->lock),f=0x1),LockOp(l=(ixattr sync),f=0x1),LockOp(l=(ifile sync),f=0x1)]
2024-04-16T10:37:22.346545445Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  auth_pin_nonblocking=0
2024-04-16T10:37:22.346556238Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  must rdlock (ilink sync) [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346567022Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  must rdlock (iauth sync->lock) [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346577606Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  must rdlock (ixattr sync) [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346591075Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 20 mds.0.locker  must rdlock (ifile sync) [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346602372Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.locker  must authpin [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346612924Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.locker  already auth_pinned [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346623594Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.locker rdlock_start  on (ifile sync) on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=1 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346634420Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.locker  got rdlock on (ifile sync r=1) [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) (ifile sync r=1) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346645276Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.locker rdlock_start  on (iauth sync->lock) on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) (ifile sync r=1) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346656142Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640  7 mds.0.locker rdlock_start waiting on (iauth sync->lock) on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) (ifile sync r=1) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:37:22.346666800Z debug 2024-04-16T10:37:22.344+0000 7f5b98c56640 10 mds.0.locker nudge_log (iauth sync->lock) on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) (ifile sync r=1) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=1 lock=2 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]

This lookup request was blocked by another old client.188843:31 setxattr #0x10000000001 request, which would need to revoke As caps from client.187061:

2024-04-16T10:36:39.299251830Z debug 2024-04-16T10:36:39.297+0000 7f5b98c56640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] <== client.188843 10.128.2.216:0/3406695915 89 ==== client_request(client.188843:31 setxattr #0x10000000001 ceph.dir.layout.pool 2024-04-16T10:36:39.297362+0000 caller_uid=0, caller_gid=0{}) ==== 188+0+39 (secure 0 0 0) 0x562fa4bc1800 con 0x562fa527e000
2024-04-16T10:36:39.299274015Z debug 2024-04-16T10:36:39.297+0000 7f5b98c56640  4 mds.0.server handle_client_request client_request(client.188843:31 setxattr #0x10000000001 ceph.dir.layout.pool 2024-04-16T10:36:39.297362+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.297+0000 7f5b98c56640 20 mds.0.80 get_session have 0x562fa6735700 client.188843 10.128.2.216:0/3406695915 state open
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 15 mds.0.server  oldest_client_tid=31
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  7 mds.0.cache request_start request(client.188843:31 nref=2 cr=0x562fa4bc1800)
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  7 mds.0.server dispatch_client_request client_request(client.188843:31 setxattr #0x10000000001 ceph.dir.layout.pool 2024-04-16T10:36:39.297362+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 10 mds.0.server handle_set_vxattr ceph.dir.layout.pool val 39 bytes on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (inest lock) caps={187061=pAsLsXs/-@9,188843=pAsLsXsFs/-@6} | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=0 authpin=0 distepin 0x562fa45a8680]
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 20 Session check_access path /volumes/csi
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 10 MDSAuthCap is_capable inode(path /volumes/csi owner 0:0 mode 040755) by caller 0:0 mask 64 new 372:0 cap: MDSAuthCaps[allow *]
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 10 mds.0.locker acquire_locks request(client.188843:31 nref=2 cr=0x562fa4bc1800)
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 20 mds.0.locker  lov = [LockOp(l=(ipolicy sync),f=0x4),LockOp(l=(isnap sync),f=0x1)]
2024-04-16T10:36:39.299336022Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 20 mds.0.locker  auth_pin_nonblocking=0
...
2024-04-16T10:36:39.299545033Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 10 mds.0.80 send_message_client_counted client.187061 seq 1857 client_caps(revoke ino 0x10000000001 442198 seq 10 caps=pLsXs dirty=- wanted=- follows 0 size 0/0 ts 1/18446744073709551615 mtime 2024-04-16T10:35:53.792712+0000 ctime 2024-04-16T10:35:53.792712+0000 change_attr 161)
2024-04-16T10:36:39.299565463Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] --> 10.128.2.232:0/2332740133 -- client_caps(revoke ino 0x10000000001 442198 seq 10 caps=pLsXs dirty=- wanted=- follows 0 size 0/0 ts 1/18446744073709551615 mtime 2024-04-16T10:35:53.792712+0000 ctime 2024-04-16T10:35:53.792712+0000 change_attr 161) -- 0x562fa78c6700 con 0x562fa4c8bc00
2024-04-16T10:36:39.299585605Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  7 mds.0.locker get_allowed_caps allowed=pLsXsFscrl, xlocker allowed=pLsXsFscrl on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=2 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (isnap sync r=1) (inest lock) (ipolicy xlock x=1 by request(client.188843:31 nref=5 cr=0x562fa4bc1800)) (iversion lock w=1 last_client=188843) (iquiesce lock w=1 last_client=188843) caps={187061=pLsXs/pAsLsXs/-@10,188843=pAsLsXsFs/-@6} | dirtyscattered=0 request=1 lock=4 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=0 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:36:39.299600539Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 20 mds.0.locker  client.188843 pending pAsLsXsFs allowed pLsXsFs wanted -
2024-04-16T10:36:39.299625338Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  7 mds.0.locker    sending MClientCaps to client.188843 seq 7 new pending pLsXsFs was pAsLsXsFs
2024-04-16T10:36:39.299625338Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 20 mds.0.cache.ino(0x10000000001) encode_cap_message pfile 0 pauth 0 plink 0 pxattr 0 mtime 2024-04-16T10:35:53.792712+0000 ctime 2024-04-16T10:35:53.792712+0000 change_attr 161
2024-04-16T10:36:39.299625338Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640 10 mds.0.80 send_message_client_counted client.188843 seq 12 client_caps(revoke ino 0x10000000001 442248 seq 7 caps=pLsXsFs dirty=- wanted=- follows 0 size 0/0 ts 1/18446744073709551615 mtime 2024-04-16T10:35:53.792712+0000 ctime 2024-04-16T10:35:53.792712+0000 change_attr 161)
2024-04-16T10:36:39.299625338Z debug 2024-04-16T10:36:39.298+0000 7f5b98c56640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] --> 10.128.2.216:0/3406695915 -- client_caps(revoke ino 0x10000000001 442248 seq 7 caps=pLsXsFs dirty=- wanted=- follows 0 size 0/0 ts 1/18446744073709551615 mtime 2024-04-16T10:35:53.792712+0000 ctime 2024-04-16T10:35:53.792712+0000 change_attr 161) -- 0x562fa5367500 con 0x562fa527e000    

But for some reason the client.187061 was lost connection suddenly and the revoke request didn't get any ack from it and then the client was evicted by MDS:

2024-04-16T10:38:44.173391735Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 10 mds.0.server find_idle_sessions. last cleared laggy state 4009.11s ago
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 10 mds.0.server skipping session client.187061 10.128.2.232:0/2332740133, timeout (300) specified and renewed caps recently (142.826s ago)
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 20 mds.0.server laggiest active session is client.24281 100.64.0.5:0/208199436 and renewed caps recently (5.47107s ago)
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 10 mds.0.locker scatter_tick
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 20 mds.0.locker caps_tick 1 revoking caps
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 20 mds.0.locker caps_tick age = 124.873832 client.187061.0x10000000001
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640  0 log_channel(cluster) log [WRN] : client.187061 isn't responding to mclientcaps(revoke), ino 0x10000000001 pending pLsXs issued pAsLsXs, sent 124.873832 seconds ago
2024-04-16T10:38:44.173443165Z debug 2024-04-16T10:38:44.172+0000 7f5b97453640 20 mds.0.locker caps_tick client.187061 isn't responding to mclientcaps(revoke), ino 0x10000000001 pending pLsXs issued pAsLsXs, sent 124.873832 seconds ago 
...
2024-04-16T10:41:24.178425473Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  0 log_channel(cluster) log [WRN] : 3 slow requests, 1 included below; oldest blocked for > 284.879140 secs
2024-04-16T10:41:24.178482691Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  0 log_channel(cluster) log [WRN] : slow request 241.832408 seconds old, received at 2024-04-16T10:37:22.345896+0000: client_request(client.188978:3 lookup #0x10000000000/csi 2024-04-16T10:37:22.344097+0000 caller_uid=0, caller_gid=0{}) currently cleaned up request
2024-04-16T10:41:24.178502998Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 10 mds.0.log trim 128 / 128 segments, 90872 / -1 events, 0 (0) expiring, 0 (0) expired
2024-04-16T10:41:24.178525012Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 10 mds.0.log _trim_expired_segments waiting for 4877189/9349108098 to expire
2024-04-16T10:41:24.178536293Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 10 mds.0.server find_idle_sessions. last cleared laggy state 4169.11s ago
2024-04-16T10:41:24.178547491Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 20 mds.0.server evicting session client.187061 10.128.2.232:0/2332740133 since autoclose has arrived
2024-04-16T10:41:24.178558828Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 20 mds.0.server laggiest active session is client.189125 10.128.2.216:0/2105887331 and renewed caps recently (18.9683s ago)
2024-04-16T10:41:24.178570107Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  0 log_channel(cluster) log [WRN] : evicting unresponsive client openshift-storage-ocs-storagecluster-cephnfs (187061), after 302.831 seconds
2024-04-16T10:41:24.178581356Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640 10 mds.0.server autoclosing stale session client.187061 10.128.2.232:0/2332740133 last renewed caps 302.831s ago
2024-04-16T10:41:24.178600021Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  1 mds.0.80 Evicting (and blocklisting) client session 187061 (10.128.2.232:0/2332740133)                                                                                      2024-04-16T10:41:24.178617834Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 187061 (10.128.2.232:0/2332740133)
2024-04-16T10:41:24.178629081Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  4 mds.0.80 Preparing blocklist command... (wait=0)
2024-04-16T10:41:24.178639896Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  4 mds.0.80 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"10.128.2.232:0/2332740133"}
2024-04-16T10:41:24.179250084Z debug 2024-04-16T10:41:24.178+0000 7f5b97453640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] --> v2:172.30.154.106:3300/0 -- mon_command({"prefix":"osd blocklist", "blocklistop":"add","addr":"10.128.2.232:0/2332740133"} v 0) -- 0x562fa8368000 con 0x562fa4562800

Then the revoke stuck caps was removed by MDS after the corresponding client was blocklisted:

2024-04-16T10:41:24.232351421Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 20 mds.0.server  killing capability pAsLsXs on [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=1 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:41:24.232351421Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 10 mds.0.cache.ino(0x10000000001) remove_client_cap last cap, leaving realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 2024-04-15T17:06:00.515604+0000 change_attr 0 0x562fa45ac000)                            

Just before this the client.188978 closed the session and the blocked request client.188978:3 was killed:

2024-04-16T10:39:58.181327563Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640 10 mds.0.server kill_session 0x562fa774aa00
2024-04-16T10:39:58.181346482Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640 10 mds.0.server journal_close_session : client.188978 10.128.2.216:0/1255716376 pending_prealloc_inos [] free_prealloc_inos [] delegated_inos []
2024-04-16T10:39:58.181358961Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640 20 mds.0.sessionmap mark_projected s=0x562fa774aa00 name=client.188978 pv=737759 -> 737760
2024-04-16T10:39:58.181374045Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640 10 mds.0.cache request_kill request(client.188978:3 nref=4 cr=0x562fa71e3500) 
2024-04-16T10:39:58.181389776Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640 15 mds.0.cache request_cleanup request(client.188978:3 nref=4 cr=0x562fa71e3500)                                                                                               2024-04-16T10:39:58.181420086Z debug 2024-04-16T10:39:58.180+0000 7f5b98c56640  7 mds.0.locker rdlock_finish on (dn sync) on [dentry #0x1/volumes/csi [2,head] auth (dversion lock) v=6994 ap=1 ino=0x10000000001 state=1610612736 | request=1 lock=0 inodepin=1 dirty=1 authpin=1 clientlease=0 0x562fa4520f00]

Then the client.188978:3 was retried but it did nothing just returned:

2024-04-16T10:41:24.232464744Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-04-16T10:41:24.232464744Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 20 mds.0.cache dispatch_request: dead request(client.188843:31 nref=1 cr=0x562fa4bc1800)
2024-04-16T10:41:24.232464744Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 10 MDSContext::complete: 18C_MDS_RetryRequest
2024-04-16T10:41:24.232464744Z debug 2024-04-16T10:41:24.232+0000 7f5b92c4a640 20 mds.0.cache dispatch_request: dead request(client.188978:3 nref=2 cr=0x562fa71e3500)  

From the above logs we can see that the nref == 2, which is incorrect.

This is because just before the request was killed another client sent a same lookup request and bactched it to the blocked one client.188978:3 :

2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  1 -- [v2:10.131.0.97:6800/3756935782,v1:10.131.0.97:6801/3756935782] <== client.189134 10.128.2.216:0/2794864414 5 ==== client_request(client.189134:3 lookup #0x10000000000/csi 2024-04-16T10:40:05.137957+0000 caller_uid=0, caller_gid=0{}) ==== 171+0+0 (secure 0 0 0) 0x562fa4bbd800 con 0x562fa4c3f800
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  4 mds.0.server handle_client_request client_request(client.189134:3 lookup #0x10000000000/csi 2024-04-16T10:40:05.137957+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 20 mds.0.80 get_session have 0x562fa6e79900 client.189134 10.128.2.216:0/2794864414 state open
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 15 mds.0.server  oldest_client_tid=2
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  7 mds.0.cache request_start request(client.189134:3 nref=2 cr=0x562fa4bbd800)
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  7 mds.0.server dispatch_client_request client_request(client.189134:3 lookup #0x10000000000/csi 2024-04-16T10:40:05.137957+0000 caller_uid=0, caller_gid=0{})
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  7 mds.0.cache traverse: opening base ino 0x10000000000 snap head
2024-04-16T10:40:05.139667308Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 12 mds.0.cache traverse: path seg depth 0 'csi' snapid head
2024-04-16T10:40:05.139687717Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 20 mds.0.cache.dir(0x10000000000) lookup (csi, 'head')
2024-04-16T10:40:05.139687717Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 20 mds.0.cache.dir(0x10000000000)   hit -> (csi,head)
2024-04-16T10:40:05.139687717Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640  7 mds.0.cache found target [inode 0x10000000001 [...2,head] /volumes/csi/ auth v6994 ap=1 DISTEPHEMERALPIN f(v0 m2024-04-16T10:35:53.792712+0000 2=0+2) n(v292 rc2024-04-16T10:35:53.792712+0000 b319 7=2+5) (iauth sync->lock) (inest lock) caps={187061=pLsXs/pAsLsXs/-@10} | dirtyscattered=0 request=0 lock=0 dirfrag=1 caps=1 dirtyrstat=0 dirty=1 waiter=1 authpin=1 distepin 0x562fa45a8680]
2024-04-16T10:40:05.139687717Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 10 mds.0.cache path_traverse finish on snapid head
2024-04-16T10:40:05.139687717Z debug 2024-04-16T10:40:05.139+0000 7f5b98c56640 20 mds.0.server handle_client_getattr: LOOKUP op, wait for previous same getattr ops to respond. request(client.189134:3 nref=3 cr=0x562fa4bbd800)

When batching it will add the batch head request's reference, but when mds detects the client.188978:3 is dead it doesn't choose a new batch head for it and doesn't release the reference too.

Actions #1

Updated by Xiubo Li 13 days ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 56941
Actions

Also available in: Atom PDF