Project

General

Profile

Actions

Bug #59533

open

Ceph hangs itself when restarting processes with hung CephFS mount

Added by Niklas Hambuechen about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a "hyperconverged" 3-node cluster where the storage (which run OSDs, MONs, MGRs) machines also have a CephFS based on the previous mounted, so that they can do some data processing work.

In such a setup, I observed that it is possible to get a permanent hang:

  • CephFS gets stuck in some way, e.g. due to temporary network issue, or resource exhaustion such as OOM.
  • Some or all Ceph processes get restarted, e.g. say an OSD and the MONs gets restarted due to OOM.
  • The OSD tries to get some info from the MON in order to start.
  • The MON does something like the equivalent of a global `df`, which hangs because the mounted CephFS is stuck.
  • We are now in deadlock.

Using Ceph 16.2.7.

I captured such a state below:

Example `dmesg` output when a `ceph-fuse` gets stuck:

[113787.908199] INFO: task /nix/store/wrw3:1076986 blocked for more than 122 seconds.
[113787.909935]       Tainted: G           O      5.10.81 #1-NixOS
[113787.911231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[113787.912792] task:/nix/store/wrw3 state:D stack:    0 pid:1076986 ppid:     1 flags:0x00004004
[113787.912796] Call Trace:
[113787.912806]  __schedule+0x217/0x830
[113787.912812]  schedule+0x46/0xb0
[113787.912818]  request_wait_answer+0x137/0x210 [fuse]
[113787.912822]  ? wait_woken+0x80/0x80
[113787.912825]  fuse_simple_request+0x1a1/0x310 [fuse]
[113787.912828]  fuse_lookup_name+0xf2/0x210 [fuse]
[113787.912831]  fuse_lookup+0x66/0x190 [fuse]
[113787.912836]  __lookup_hash+0x6c/0xa0
[113787.912838]  filename_create+0x91/0x160
[113787.912840]  do_mkdirat+0x57/0x150
[113787.912844]  do_syscall_64+0x33/0x40
[113787.912848]  entry_SYSCALL_64_after_hwframe+0x44/0xa9```

`strace` of a `ceph-osd` stuck:

# strace -fyp 1176939                 
strace: Process 1176939 attached with 8 threads
[pid 1176946] futex(0x7fa0e5b46f00, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176945] futex(0x7fa0e5b46d84, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176944] epoll_wait(9<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176943] epoll_wait(6<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176942] epoll_wait(3<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176941] futex(0x7fa0e5a74698, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176940] futex(0x7fa0e5a63870, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176939] restart_syscall(<... resuming interrupted read ...>```

The OSD's FDs:

# d /proc/1176939/fd
total 0
dr-x------ 2 ceph ceph  0 Apr 25 00:34 .
dr-xr-xr-x 9 ceph ceph  0 Apr 25 00:34 ..
lr-x------ 1 ceph ceph 64 Apr 25 00:45 0 -> /dev/null
lrwx------ 1 ceph ceph 64 Apr 25 00:45 1 -> 'socket:[4963967]'
lr-x------ 1 ceph ceph 64 Apr 25 00:45 10 -> 'pipe:[4963978]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 11 -> 'pipe:[4963978]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 2 -> 'socket:[4963967]'
lrwx------ 1 ceph ceph 64 Apr 25 00:34 3 -> 'anon_inode:[eventpoll]'
lr-x------ 1 ceph ceph 64 Apr 25 00:34 4 -> 'pipe:[4963976]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 5 -> 'pipe:[4963976]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 6 -> 'anon_inode:[eventpoll]'
lr-x------ 1 ceph ceph 64 Apr 25 00:45 7 -> 'pipe:[4963977]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 8 -> 'pipe:[4963977]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 9 -> 'anon_inode:[eventpoll]'

Here perhaps the OSD itself does not do something that can block (such as `df`) itself, but talks via pipe or socket to a process that does it (e.g. the MON).

From my understanding, while Ceph recommends keeping mons, ODS, and clients separate, Ceph wants to enable hyperconverged setups as well.

So the issue is:

Is there anything in one of the startup paths of any of the daemons that does the equivalent of an unconstrained `df` (or any other global FS operation that can block when a CephFS mount is down), which could cause this?

If yes, can it be removed?

Actions

Also available in: Atom PDF