Bug #59162
openkernel completes umount without waiting for mds close
100%
Description
While debugging (probably) unrelated bugs, I observed that the MDS is unintentionally waiting for a session to reconnect because closing it was incomplete.
2023-03-23T19:28:13.323 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0... 2023-03-23T19:28:13.323 INFO:teuthology.orchestra.run:Running command with timeout 300 2023-03-23T19:28:13.324 DEBUG:teuthology.orchestra.run.smithi119:> sudo umount /home/ubuntu/cephtest/mnt.0 2023-03-23T19:28:35.291 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:28:35.298 DEBUG:teuthology.orchestra.run.smithi148:> sudo elogrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:05.471 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:05.478 DEBUG:teuthology.orchestra.run.smithi148:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:14.827 INFO:tasks.cephfs.mount:Cleaning up mount ubuntu@smithi119.front.sepia.ceph.com 2023-03-23T19:29:14.827 INFO:teuthology.orchestra.run:Running command with timeout 300 2023-03-23T19:29:14.828 DEBUG:teuthology.orchestra.run.smithi119:> (cd /home/ubuntu/cephtest && exec rmdir -- /home/ubuntu/cephtest/mnt.0)
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/teuthology.log
That mds should spend 0 time in up:reconnect as its session list should be empty. It does end up waiting later when its restarted.
Root cause:
2023-03-23T19:28:13.357+0000 7f76c36a2700 1 -- [v2:172.21.15.119:6836/2486776166,v1:172.21.15.119:6837/2486776166] <== client.6707 v1:172.21.15.119:0/4076911234 41 ==== client_session(request_close seq 9) ==== 28+0+0 (unknown 2629451164 0 0) 0x558caa874c40 con 0x558caa6c9800 2023-03-23T19:28:13.357+0000 7f76c36a2700 20 mds.0.82 get_session have 0x558ca9883200 client.6707 v1:172.21.15.119:0/4076911234 state open 2023-03-23T19:28:13.357+0000 7f76c36a2700 3 mds.0.server handle_client_session client_session(request_close seq 9) from client.6707 2023-03-23T19:28:13.357+0000 7f76c36a2700 10 mds.0.server old push seq 9 < 10, dropping
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/log/ceph-mds.c.log.gz
the kernel never sends a new request_close with seq==10. The dmesg output is not helpful:
2023-03-23T19:28:10.043732+00:00 smithi119 kernel: [ 1201.283891] libceph: mon2 (1)172.21.15.148:6790 session established 2023-03-23T19:28:10.047724+00:00 smithi119 kernel: [ 1201.285437] libceph: client6707 fsid 84eed8f0-55d2-47bf-9a48-676033fb7934
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/syslog/kern.log.gz
My cursory look at the kernel code indicates this should not happen: the kernel will resend the close request if the seq is increased on a subsequent message. So something else is going on.