Bug #59162
openkernel completes umount without waiting for mds close
100%
Description
While debugging (probably) unrelated bugs, I observed that the MDS is unintentionally waiting for a session to reconnect because closing it was incomplete.
2023-03-23T19:28:13.323 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0... 2023-03-23T19:28:13.323 INFO:teuthology.orchestra.run:Running command with timeout 300 2023-03-23T19:28:13.324 DEBUG:teuthology.orchestra.run.smithi119:> sudo umount /home/ubuntu/cephtest/mnt.0 2023-03-23T19:28:35.291 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:28:35.298 DEBUG:teuthology.orchestra.run.smithi148:> sudo elogrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:05.471 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:05.478 DEBUG:teuthology.orchestra.run.smithi148:> sudo logrotate /etc/logrotate.d/ceph-test.conf 2023-03-23T19:29:14.827 INFO:tasks.cephfs.mount:Cleaning up mount ubuntu@smithi119.front.sepia.ceph.com 2023-03-23T19:29:14.827 INFO:teuthology.orchestra.run:Running command with timeout 300 2023-03-23T19:29:14.828 DEBUG:teuthology.orchestra.run.smithi119:> (cd /home/ubuntu/cephtest && exec rmdir -- /home/ubuntu/cephtest/mnt.0)
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/teuthology.log
That mds should spend 0 time in up:reconnect as its session list should be empty. It does end up waiting later when its restarted.
Root cause:
2023-03-23T19:28:13.357+0000 7f76c36a2700 1 -- [v2:172.21.15.119:6836/2486776166,v1:172.21.15.119:6837/2486776166] <== client.6707 v1:172.21.15.119:0/4076911234 41 ==== client_session(request_close seq 9) ==== 28+0+0 (unknown 2629451164 0 0) 0x558caa874c40 con 0x558caa6c9800 2023-03-23T19:28:13.357+0000 7f76c36a2700 20 mds.0.82 get_session have 0x558ca9883200 client.6707 v1:172.21.15.119:0/4076911234 state open 2023-03-23T19:28:13.357+0000 7f76c36a2700 3 mds.0.server handle_client_session client_session(request_close seq 9) from client.6707 2023-03-23T19:28:13.357+0000 7f76c36a2700 10 mds.0.server old push seq 9 < 10, dropping
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/log/ceph-mds.c.log.gz
the kernel never sends a new request_close with seq==10. The dmesg output is not helpful:
2023-03-23T19:28:10.043732+00:00 smithi119 kernel: [ 1201.283891] libceph: mon2 (1)172.21.15.148:6790 session established 2023-03-23T19:28:10.047724+00:00 smithi119 kernel: [ 1201.285437] libceph: client6707 fsid 84eed8f0-55d2-47bf-9a48-676033fb7934
/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/syslog/kern.log.gz
My cursory look at the kernel code indicates this should not happen: the kernel will resend the close request if the seq is increased on a subsequent message. So something else is going on.
Updated by Patrick Donnelly about 1 year ago
I believe this is caused by "ceph: drop the messages from MDS when unmounting" in the testing branch.
Updated by Xiubo Li about 1 year ago
In kclient we need to increase the sessions' seq just before dropping the `cap grant` message.
Updated by Xiubo Li about 1 year ago
Patrick,
I am thinking whether should we fix the code in case when kclient receives a corrupted msg, which will also skip increasing and resending the close request too ? IMO even the msg corrupted but the MDS side should have increased the seq already and in client side we should increase it anyway.
This is unrelated to the ceph: drop the messages from MDS when unmounting commit, this is an issue existing long time ago.
Updated by Patrick Donnelly about 1 year ago
Xiubo Li wrote:
Patrick,
I am thinking whether should we fix the code in case when kclient receives a corrupt msg, which will also skip increasing and resending the close request too ?
Like a legit corrupt message that cannot be decoded? If a message cannot be decoded, we routinely drop it in Ceph. I would expect the kernel should have some sort of "stop all i/o and render the mount unusable" if that happens. I don't think that's necessarily relevant for this ticket though.
IMO even the msg corrupted but the MDS side should have increased the seq already and in client side we should increase it anyway.
The MDS did increase the seq and send a new cap message. The problem is the kernel didn't resend the close request after seeing the new seq bump. Even more, the kernel completed unmount without waiting for the CEPH_SESSION_CLOSE message from the mds (no --force flag was used!).
This is unrelated to the ceph: drop the messages from MDS when unmounting commit, this is a issue existing long time ago.
Okay, I'm surprised we haven't seen this before then...
Updated by Patrick Donnelly about 1 year ago
- Related to Bug #59163: mds: stuck in up:rejoin when it cannot "open" missing directory inode added
Updated by Xiubo Li about 1 year ago
Patrick Donnelly wrote:
Xiubo Li wrote:
Patrick,
I am thinking whether should we fix the code in case when kclient receives a corrupt msg, which will also skip increasing and resending the close request too ?
Like a legit corrupt message that cannot be decoded? If a message cannot be decoded, we routinely drop it in Ceph. I would expect the kernel should have some sort of "stop all i/o and render the mount unusable" if that happens. I don't think that's necessarily relevant for this ticket though.
Yeah, kclient will drop the corrupted messages in some case while in some cases we will mark the whole mounter readonly. It will depends.
IMO even the msg corrupted but the MDS side should have increased the seq already and in client side we should increase it anyway.
The MDS did increase the seq and send a new cap message. The problem is the kernel didn't resend the close request after seeing the new seq bump. Even more, the kernel completed unmount without waiting for the CEPH_SESSION_CLOSE message from the mds (no --force flag was used!).
No, the kclient will wait for CEPH_SESSION_CLOSE messages from MDSs for 60 seconds by default. After that it will tear down the sessions.
The ceph: drop the messages from MDS when unmounting commit is trying to fix one reference leakage bug, which could cause use-after-free crash for fs/crypto/, for fscrypt feature.
And when unmounting after all the in-flight IOs have finished, which all reqs have gotten the replies from MDSs/OSDs, the MDS may continue sending new cap messages, which make no sense anymore, to kclient. This commit is trying to drop them. But forgot to resend the close session request before dropping them. I have fixed it already in [1].
This is unrelated to the ceph: drop the messages from MDS when unmounting commit, this is a issue existing long time ago.
Okay, I'm surprised we haven't seen this before then...
Because it's a corner case when unmounting the kclient will it receive a corrupted msg. Months ago Venky and I hit one issue that the MDS sent one msg to kclient and the kclient couldn't correctly decode it due to the version mismatch issue. And the kclient dropped that msg directly.
In this case the MDS side has increased the seq anyway.
[1] https://patchwork.kernel.org/project/ceph-devel/list/?series=734003
Updated by Xiubo Li about 1 year ago
- Status changed from New to Fix Under Review