Bug #59533
openCeph hangs itself when restarting processes with hung CephFS mount
0%
Description
We have a "hyperconverged" 3-node cluster where the storage (which run OSDs, MONs, MGRs) machines also have a CephFS based on the previous mounted, so that they can do some data processing work.
In such a setup, I observed that it is possible to get a permanent hang:
- CephFS gets stuck in some way, e.g. due to temporary network issue, or resource exhaustion such as OOM.
- Some or all Ceph processes get restarted, e.g. say an OSD and the MONs gets restarted due to OOM.
- The OSD tries to get some info from the MON in order to start.
- The MON does something like the equivalent of a global `df`, which hangs because the mounted CephFS is stuck.
- We are now in deadlock.
Using Ceph 16.2.7.
I captured such a state below:
Example `dmesg` output when a `ceph-fuse` gets stuck:
[113787.908199] INFO: task /nix/store/wrw3:1076986 blocked for more than 122 seconds. [113787.909935] Tainted: G O 5.10.81 #1-NixOS [113787.911231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [113787.912792] task:/nix/store/wrw3 state:D stack: 0 pid:1076986 ppid: 1 flags:0x00004004 [113787.912796] Call Trace: [113787.912806] __schedule+0x217/0x830 [113787.912812] schedule+0x46/0xb0 [113787.912818] request_wait_answer+0x137/0x210 [fuse] [113787.912822] ? wait_woken+0x80/0x80 [113787.912825] fuse_simple_request+0x1a1/0x310 [fuse] [113787.912828] fuse_lookup_name+0xf2/0x210 [fuse] [113787.912831] fuse_lookup+0x66/0x190 [fuse] [113787.912836] __lookup_hash+0x6c/0xa0 [113787.912838] filename_create+0x91/0x160 [113787.912840] do_mkdirat+0x57/0x150 [113787.912844] do_syscall_64+0x33/0x40 [113787.912848] entry_SYSCALL_64_after_hwframe+0x44/0xa9```
`strace` of a `ceph-osd` stuck:
# strace -fyp 1176939 strace: Process 1176939 attached with 8 threads [pid 1176946] futex(0x7fa0e5b46f00, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1176945] futex(0x7fa0e5b46d84, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1176944] epoll_wait(9<anon_inode:[eventpoll]>, <unfinished ...> [pid 1176943] epoll_wait(6<anon_inode:[eventpoll]>, <unfinished ...> [pid 1176942] epoll_wait(3<anon_inode:[eventpoll]>, <unfinished ...> [pid 1176941] futex(0x7fa0e5a74698, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1176940] futex(0x7fa0e5a63870, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 1176939] restart_syscall(<... resuming interrupted read ...>```
The OSD's FDs:
# d /proc/1176939/fd total 0 dr-x------ 2 ceph ceph 0 Apr 25 00:34 . dr-xr-xr-x 9 ceph ceph 0 Apr 25 00:34 .. lr-x------ 1 ceph ceph 64 Apr 25 00:45 0 -> /dev/null lrwx------ 1 ceph ceph 64 Apr 25 00:45 1 -> 'socket:[4963967]' lr-x------ 1 ceph ceph 64 Apr 25 00:45 10 -> 'pipe:[4963978]' l-wx------ 1 ceph ceph 64 Apr 25 00:45 11 -> 'pipe:[4963978]' lrwx------ 1 ceph ceph 64 Apr 25 00:45 2 -> 'socket:[4963967]' lrwx------ 1 ceph ceph 64 Apr 25 00:34 3 -> 'anon_inode:[eventpoll]' lr-x------ 1 ceph ceph 64 Apr 25 00:34 4 -> 'pipe:[4963976]' l-wx------ 1 ceph ceph 64 Apr 25 00:45 5 -> 'pipe:[4963976]' lrwx------ 1 ceph ceph 64 Apr 25 00:45 6 -> 'anon_inode:[eventpoll]' lr-x------ 1 ceph ceph 64 Apr 25 00:45 7 -> 'pipe:[4963977]' l-wx------ 1 ceph ceph 64 Apr 25 00:45 8 -> 'pipe:[4963977]' lrwx------ 1 ceph ceph 64 Apr 25 00:45 9 -> 'anon_inode:[eventpoll]'
Here perhaps the OSD itself does not do something that can block (such as `df`) itself, but talks via pipe or socket to a process that does it (e.g. the MON).
From my understanding, while Ceph recommends keeping mons, ODS, and clients separate, Ceph wants to enable hyperconverged setups as well.
So the issue is:
Is there anything in one of the startup paths of any of the daemons that does the equivalent of an unconstrained `df` (or any other global FS operation that can block when a CephFS mount is down), which could cause this?
If yes, can it be removed?