Actions
Bug #47563
closedqa: kernel client closes session improperly causing eviction due to timeout
% Done:
0%
Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2020-09-20T13:43:05.420 INFO:tasks.workunit:Stopping ['suites/pjd.sh'] on client.0... 2020-09-20T13:43:06.310 DEBUG:teuthology.parallel:result is None 2020-09-20T13:43:06.435 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0 2020-09-20T13:43:06.435 DEBUG:teuthology.run_tasks:Unwinding manager kclient 2020-09-20T13:43:06.447 INFO:tasks.kclient:Unmounting kernel clients... 2020-09-20T13:43:06.447 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0... 2020-09-20T13:43:06.448 INFO:teuthology.orchestra.run:Running command with timeout 900 2020-09-20T13:43:06.448 INFO:teuthology.orchestra.run.smithi093:> sudo umount /home/ubuntu/cephtest/mnt.0 2020-09-20T13:43:51.775 INFO:teuthology.orchestra.run.smithi084:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early fs dump --format=json
From: /ceph/teuthology-archive/pdonnell-2020-09-20_07:13:47-multimds-wip-pdonnell-testing-20200920.040823-distro-basic-smithi/5451505/teuthology.log
(There is also thrashing going on but that's after the kernel unmount and not related.)
Then on the MDS:
2020-09-20T13:43:06.582+0000 7f69fb0ed700 7 mds.0.locker issue_caps loner client.4713 allowed=pAsLsXsFsxcrwb, xlocker allowed=pAsLsXsFsxcrwb, others allowed=pAsLsXs on [inode 0x10000000511 [2,head] ~mds0/stray1/10000000511 auth v2088 pv2302 ap=1 dirtyparent s=1 nl=0 n(v0 rc2020-09-20T13:42:39.280144+0000 b1 1=1+0)->n(v0 rc2020-09-20T13:42:39.280144+0000 b1 1=1+0) (ifile excl w=1) (iversion lock) cr={4713=0-4194304@1} caps={4713=pFc/-@6},l=4713(-1) | ptrwaiter=1 request=0 lock=1 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b1c67e5080] 2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.locker client.4713 pending pFc allowed pAsLsXsFsxcrwb wanted - 2020-09-20T13:43:06.582+0000 7f69fb0ed700 7 mds.0.locker sending MClientCaps to client.4713 seq 7 new pending pAsLsXsFsxcrwb was pFc 2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.cache.ino(0x10000000511) encode_cap_message pfile 1 pauth 0 plink 0 pxattr 0 ctime 2020-09-20T13:42:39.280144+0000 2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.18 send_message_client_counted client.4713 seq 1926 client_caps(grant ino 0x10000000511 1 seq 7 caps=pAsLsXsFsxcrwb dirty=- wanted=Ls follows 0 size 1/0 ts 2/1 mtime 2020-09-20T13:42:39.237145+0000 tws 1) v11 2020-09-20T13:43:06.582+0000 7f69fb0ed700 1 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] --> v1:172.21.15.93:0/4080061763 -- client_caps(grant ino 0x10000000511 1 seq 7 caps=pAsLsXsFsxcrwb dirty=- wanted=Ls follows 0 size 1/0 ts 2/1 mtime 2020-09-20T13:42:39.237145+0000 tws 1) v11 -- 0x55b1c6718900 con 0x55b1c5084000 2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.locker eval done 2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] dispatch_throttle_release 236 to dispatch throttler 264/104857600 2020-09-20T13:43:06.582+0000 7f69fb0ed700 1 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] <== client.4713 v1:172.21.15.93:0/4080061763 2777 ==== client_session(request_close seq 1925) ==== 28+0+0 (unknown 1265461456 0 0) 0x55b1c6dd4e00 con 0x55b1c5084000 2020-09-20T13:43:06.582+0000 7f69fd8f2700 10 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] >> v1:172.21.15.93:0/4080061763 conn(0x55b1c5084000 legacy=0x55b1c5c1a000 unknown :6827 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write 2020-09-20T13:43:06.582+0000 7f69fd8f2700 10 --1- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] >> v1:172.21.15.93:0/4080061763 conn(0x55b1c5084000 0x55b1c5c1a000 :6827 s=OPENED pgs=13 cs=1 l=0).write_event 2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.18 get_session have 0x55b1c4305600 client.4713 v1:172.21.15.93:0/4080061763 state open 2020-09-20T13:43:06.582+0000 7f69fb0ed700 3 mds.0.server handle_client_session client_session(request_close seq 1925) from client.4713 2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.server old push seq 1925 < 1926, dropping
The MDS drops the close request. (This could arguably be a bug in the MDS too!) It appears the connection was never shut down so the kclient is waiting for the MDS to respond. The thrasher caused a failover which appears to have allowed the kernel to finish unmounting:
2020-09-20T13:44:06.839 INFO:tasks.mds_thrash.fs.[cephfs]:waiting till mds map indicates mds.c is laggy/crashed, in failed state, or mds.c is removed from mdsmap 2020-09-20T13:44:07.235 INFO:teuthology.orchestra.run:Running command with timeout 300 2020-09-20T13:44:07.235 INFO:teuthology.orchestra.run.smithi093:> (cd /home/ubuntu/cephtest && exec rmdir -- /home/ubuntu/cephtest/mnt.0)
Actions