Project

General

Profile

Actions

Bug #59162

open

kernel completes umount without waiting for mds close

Added by Patrick Donnelly about 1 year ago. Updated about 1 year ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

While debugging (probably) unrelated bugs, I observed that the MDS is unintentionally waiting for a session to reconnect because closing it was incomplete.

2023-03-23T19:28:13.323 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2023-03-23T19:28:13.323 INFO:teuthology.orchestra.run:Running command with timeout 300
2023-03-23T19:28:13.324 DEBUG:teuthology.orchestra.run.smithi119:> sudo umount /home/ubuntu/cephtest/mnt.0
2023-03-23T19:28:35.291 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2023-03-23T19:28:35.298 DEBUG:teuthology.orchestra.run.smithi148:> sudo elogrotate /etc/logrotate.d/ceph-test.conf
2023-03-23T19:29:05.471 DEBUG:teuthology.orchestra.run.smithi119:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2023-03-23T19:29:05.478 DEBUG:teuthology.orchestra.run.smithi148:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2023-03-23T19:29:14.827 INFO:tasks.cephfs.mount:Cleaning up mount ubuntu@smithi119.front.sepia.ceph.com
2023-03-23T19:29:14.827 INFO:teuthology.orchestra.run:Running command with timeout 300
2023-03-23T19:29:14.828 DEBUG:teuthology.orchestra.run.smithi119:> (cd /home/ubuntu/cephtest && exec rmdir -- /home/ubuntu/cephtest/mnt.0)

/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/teuthology.log

That mds should spend 0 time in up:reconnect as its session list should be empty. It does end up waiting later when its restarted.

Root cause:

2023-03-23T19:28:13.357+0000 7f76c36a2700  1 -- [v2:172.21.15.119:6836/2486776166,v1:172.21.15.119:6837/2486776166] <== client.6707 v1:172.21.15.119:0/4076911234 41 ==== client_session(request_close seq 9) ==== 28+0+0 (unknown 2629451164 0 0) 0x558caa874c40 con 0x558caa6c9800                                                                                                          
2023-03-23T19:28:13.357+0000 7f76c36a2700 20 mds.0.82 get_session have 0x558ca9883200 client.6707 v1:172.21.15.119:0/4076911234 state open
2023-03-23T19:28:13.357+0000 7f76c36a2700  3 mds.0.server handle_client_session client_session(request_close seq 9) from client.6707
2023-03-23T19:28:13.357+0000 7f76c36a2700 10 mds.0.server old push seq 9 < 10, dropping

/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/log/ceph-mds.c.log.gz

the kernel never sends a new request_close with seq==10. The dmesg output is not helpful:

2023-03-23T19:28:10.043732+00:00 smithi119 kernel: [ 1201.283891] libceph: mon2 (1)172.21.15.148:6790 session established
2023-03-23T19:28:10.047724+00:00 smithi119 kernel: [ 1201.285437] libceph: client6707 fsid 84eed8f0-55d2-47bf-9a48-676033fb7934

/teuthology/pdonnell-2023-03-23_18:44:22-fs-wip-pdonnell-testing-20230323.162417-distro-default-smithi/7217902/remote/smithi119/syslog/kern.log.gz

My cursory look at the kernel code indicates this should not happen: the kernel will resend the close request if the seq is increased on a subsequent message. So something else is going on.


Subtasks 1 (0 open1 closed)

Bug #58947: client: didn't send the reconnect to mdsDuplicateXiubo Li

Actions

Related issues 1 (1 open0 closed)

Related to CephFS - Bug #59163: mds: stuck in up:rejoin when it cannot "open" missing directory inodeNewVenky Shankar

Actions
Actions

Also available in: Atom PDF