Project

General

Profile

Actions

Bug #47563

closed

qa: kernel client closes session improperly causing eviction due to timeout

Added by Patrick Donnelly over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-09-20T13:43:05.420 INFO:tasks.workunit:Stopping ['suites/pjd.sh'] on client.0...
2020-09-20T13:43:06.310 DEBUG:teuthology.parallel:result is None
2020-09-20T13:43:06.435 INFO:tasks.workunit:Deleted dir /home/ubuntu/cephtest/mnt.0/client.0
2020-09-20T13:43:06.435 DEBUG:teuthology.run_tasks:Unwinding manager kclient
2020-09-20T13:43:06.447 INFO:tasks.kclient:Unmounting kernel clients...
2020-09-20T13:43:06.447 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2020-09-20T13:43:06.448 INFO:teuthology.orchestra.run:Running command with timeout 900
2020-09-20T13:43:06.448 INFO:teuthology.orchestra.run.smithi093:> sudo umount /home/ubuntu/cephtest/mnt.0
2020-09-20T13:43:51.775 INFO:teuthology.orchestra.run.smithi084:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph --log-early fs dump --format=json

From: /ceph/teuthology-archive/pdonnell-2020-09-20_07:13:47-multimds-wip-pdonnell-testing-20200920.040823-distro-basic-smithi/5451505/teuthology.log

(There is also thrashing going on but that's after the kernel unmount and not related.)

Then on the MDS:

2020-09-20T13:43:06.582+0000 7f69fb0ed700  7 mds.0.locker issue_caps loner client.4713 allowed=pAsLsXsFsxcrwb, xlocker allowed=pAsLsXsFsxcrwb, others allowed=pAsLsXs on [inode 0x10000000511 [2,head] ~mds0/stray1/10000000511 auth v2088 pv2302 ap=1 dirtyparent s=1 nl=0 n(v0 rc2020-09-20T13:42:39.280144+0000 b1 1=1+0)->n(v0 rc2020-09-20T13:42:39.280144+0000 b1 1=1+0) (ifile excl w=1) (iversion lock) cr={4713=0-4194304@1} caps={4713=pFc/-@6},l=4713(-1) | ptrwaiter=1 request=0 lock=1 caps=1 truncating=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x55b1c67e5080]
2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.locker  client.4713 pending pFc allowed pAsLsXsFsxcrwb wanted -
2020-09-20T13:43:06.582+0000 7f69fb0ed700  7 mds.0.locker    sending MClientCaps to client.4713 seq 7 new pending pAsLsXsFsxcrwb was pFc
2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.cache.ino(0x10000000511) encode_cap_message pfile 1 pauth 0 plink 0 pxattr 0 ctime 2020-09-20T13:42:39.280144+0000
2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.18 send_message_client_counted client.4713 seq 1926 client_caps(grant ino 0x10000000511 1 seq 7 caps=pAsLsXsFsxcrwb dirty=- wanted=Ls follows 0 size 1/0 ts 2/1 mtime 2020-09-20T13:42:39.237145+0000 tws 1) v11
2020-09-20T13:43:06.582+0000 7f69fb0ed700  1 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] --> v1:172.21.15.93:0/4080061763 -- client_caps(grant ino 0x10000000511 1 seq 7 caps=pAsLsXsFsxcrwb dirty=- wanted=Ls follows 0 size 1/0 ts 2/1 mtime 2020-09-20T13:42:39.237145+0000 tws 1) v11 -- 0x55b1c6718900 con 0x55b1c5084000
2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.locker eval done
2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] dispatch_throttle_release 236 to dispatch throttler 264/104857600
2020-09-20T13:43:06.582+0000 7f69fb0ed700  1 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] <== client.4713 v1:172.21.15.93:0/4080061763 2777 ==== client_session(request_close seq 1925) ==== 28+0+0 (unknown 1265461456 0 0) 0x55b1c6dd4e00 con 0x55b1c5084000
2020-09-20T13:43:06.582+0000 7f69fd8f2700 10 -- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] >> v1:172.21.15.93:0/4080061763 conn(0x55b1c5084000 legacy=0x55b1c5c1a000 unknown :6827 s=STATE_CONNECTION_ESTABLISHED l=0).handle_write
2020-09-20T13:43:06.582+0000 7f69fd8f2700 10 --1- [v2:172.21.15.84:6825/1405287455,v1:172.21.15.84:6827/1405287455] >> v1:172.21.15.93:0/4080061763 conn(0x55b1c5084000 0x55b1c5c1a000 :6827 s=OPENED pgs=13 cs=1 l=0).write_event
2020-09-20T13:43:06.582+0000 7f69fb0ed700 20 mds.0.18 get_session have 0x55b1c4305600 client.4713 v1:172.21.15.93:0/4080061763 state open
2020-09-20T13:43:06.582+0000 7f69fb0ed700  3 mds.0.server handle_client_session client_session(request_close seq 1925) from client.4713
2020-09-20T13:43:06.582+0000 7f69fb0ed700 10 mds.0.server old push seq 1925 < 1926, dropping

The MDS drops the close request. (This could arguably be a bug in the MDS too!) It appears the connection was never shut down so the kclient is waiting for the MDS to respond. The thrasher caused a failover which appears to have allowed the kernel to finish unmounting:

2020-09-20T13:44:06.839 INFO:tasks.mds_thrash.fs.[cephfs]:waiting till mds map indicates mds.c is laggy/crashed, in failed state, or mds.c is removed from mdsmap
2020-09-20T13:44:07.235 INFO:teuthology.orchestra.run:Running command with timeout 300
2020-09-20T13:44:07.235 INFO:teuthology.orchestra.run.smithi093:> (cd /home/ubuntu/cephtest && exec rmdir -- /home/ubuntu/cephtest/mnt.0)

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #48439: fsstress failure with mds thrashing: "mds.0.6 Evicting (and blocklisting) client session 4564 (v1:172.21.15.47:0/603539598)"ResolvedJeff Layton

Actions
Actions

Also available in: Atom PDF