Feature #59601
openProvide way to abort kernel mount after lazy umount
0%
Description
In some situations, e.g. when changing monitor IPs during an emergency network reconfiguration, CephFS kernel mounts (kclient) can be stuck trying to talk IPs that do not speak Ceph.
In `dmesg` it can look like this:
[312094.140219] libceph: osd1 (2)10.0.0.1:6800 read processing error [312094.650106] libceph: mon1 (2)10.0.0.1:3300 socket closed (con state V2_BANNER_PREFIX) [312099.771370] libceph: auth protocol 'cephx' authorization to mds failed: -13
For this or other reasons, the user may need to use `umount --lazy` / `umount -l` to get their mount point back and start a new mount.
However, the old mount is still going on in the background, and apparently cannot be ejected.
For example
/sys/kernel/debug/ceph/478b062f-a6e4-4ddf-96e0-7cdad91816e4.client38260/
still exists, and may show ops such as:
# cat /sys/kernel/debug/ceph/478b062f-a6e4-4ddf-96e0-7cdad91816e4.client38260/osdc REQUESTS 1 homeless 0 910 osd1 1.671086fe 1.3e [1,0,2]/1 [1,0,2]/1 e254 100000002c3.00000000 0x400024 1 write LINGER REQUESTS BACKOFFS
There should be a way for admins to force-eject such detached mounts.
Based on this thread https://www.spinics.net/lists/ceph-devel/msg00555.html it seemed that developers agreed:
Sage Weil:
'umount -l' should do a lazy unmount (detach from namespace), but the actual unmount code may currently hang.
and further it is discussed that there should be a configurable timeout, or an admin hook to kick out the mount.
In this ticket I'm asking for the latter.
(If this already exists, please considere it a documentation bug; it should be added on e.g. https://docs.ceph.com/en/quincy/cephfs/mount-using-kernel-driver/#unmounting-cephfs)
Updated by Venky Shankar 12 months ago
Niklas Hambuechen wrote:
In some situations, e.g. when changing monitor IPs during an emergency network reconfiguration, CephFS kernel mounts (kclient) can be stuck trying to talk IPs that do not speak Ceph.
In `dmesg` it can look like this:
[...]
For this or other reasons, the user may need to use `umount --lazy` / `umount -l` to get their mount point back and start a new mount.
However, the old mount is still going on in the background, and apparently cannot be ejected.
For example
[...]
still exists, and may show ops such as:
[...]
There should be a way for admins to force-eject such detached mounts.
Based on this thread https://www.spinics.net/lists/ceph-devel/msg00555.html it seemed that developers agreed:
Sage Weil:
'umount -l' should do a lazy unmount (detach from namespace), but the actual unmount code may currently hang.
and further it is discussed that there should be a configurable timeout, or an admin hook to kick out the mount.
In this ticket I'm asking for the latter.
(If this already exists, please considere it a documentation bug; it should be added on e.g. https://docs.ceph.com/en/quincy/cephfs/mount-using-kernel-driver/#unmounting-cephfs)
Have you tried force unmounting the mount (unount -f)?
Updated by Niklas Hambuechen 12 months ago
Venky Shankar wrote:
Have you tried force unmounting the mount (unount -f)?
After umount --lazy, the mount point is already empty.
In that case, what could I run umount -f on?
Updated by Venky Shankar 12 months ago
Niklas Hambuechen wrote:
Venky Shankar wrote:
Have you tried force unmounting the mount (unount -f)?
After umount --lazy, the mount point is already empty.
In that case, what could I run umount -f on?
What I meant was to use `umount -f` instead of lazy unmount. Since a bunch of addresses changed and the IOs are not going anywhere,