Project

General

Profile

Actions

Bug #59533

open

Ceph hangs itself when restarting processes with hung CephFS mount

Added by Niklas Hambuechen about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a "hyperconverged" 3-node cluster where the storage (which run OSDs, MONs, MGRs) machines also have a CephFS based on the previous mounted, so that they can do some data processing work.

In such a setup, I observed that it is possible to get a permanent hang:

  • CephFS gets stuck in some way, e.g. due to temporary network issue, or resource exhaustion such as OOM.
  • Some or all Ceph processes get restarted, e.g. say an OSD and the MONs gets restarted due to OOM.
  • The OSD tries to get some info from the MON in order to start.
  • The MON does something like the equivalent of a global `df`, which hangs because the mounted CephFS is stuck.
  • We are now in deadlock.

Using Ceph 16.2.7.

I captured such a state below:

Example `dmesg` output when a `ceph-fuse` gets stuck:

[113787.908199] INFO: task /nix/store/wrw3:1076986 blocked for more than 122 seconds.
[113787.909935]       Tainted: G           O      5.10.81 #1-NixOS
[113787.911231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[113787.912792] task:/nix/store/wrw3 state:D stack:    0 pid:1076986 ppid:     1 flags:0x00004004
[113787.912796] Call Trace:
[113787.912806]  __schedule+0x217/0x830
[113787.912812]  schedule+0x46/0xb0
[113787.912818]  request_wait_answer+0x137/0x210 [fuse]
[113787.912822]  ? wait_woken+0x80/0x80
[113787.912825]  fuse_simple_request+0x1a1/0x310 [fuse]
[113787.912828]  fuse_lookup_name+0xf2/0x210 [fuse]
[113787.912831]  fuse_lookup+0x66/0x190 [fuse]
[113787.912836]  __lookup_hash+0x6c/0xa0
[113787.912838]  filename_create+0x91/0x160
[113787.912840]  do_mkdirat+0x57/0x150
[113787.912844]  do_syscall_64+0x33/0x40
[113787.912848]  entry_SYSCALL_64_after_hwframe+0x44/0xa9```

`strace` of a `ceph-osd` stuck:

# strace -fyp 1176939                 
strace: Process 1176939 attached with 8 threads
[pid 1176946] futex(0x7fa0e5b46f00, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176945] futex(0x7fa0e5b46d84, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176944] epoll_wait(9<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176943] epoll_wait(6<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176942] epoll_wait(3<anon_inode:[eventpoll]>,  <unfinished ...>
[pid 1176941] futex(0x7fa0e5a74698, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176940] futex(0x7fa0e5a63870, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 1176939] restart_syscall(<... resuming interrupted read ...>```

The OSD's FDs:

# d /proc/1176939/fd
total 0
dr-x------ 2 ceph ceph  0 Apr 25 00:34 .
dr-xr-xr-x 9 ceph ceph  0 Apr 25 00:34 ..
lr-x------ 1 ceph ceph 64 Apr 25 00:45 0 -> /dev/null
lrwx------ 1 ceph ceph 64 Apr 25 00:45 1 -> 'socket:[4963967]'
lr-x------ 1 ceph ceph 64 Apr 25 00:45 10 -> 'pipe:[4963978]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 11 -> 'pipe:[4963978]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 2 -> 'socket:[4963967]'
lrwx------ 1 ceph ceph 64 Apr 25 00:34 3 -> 'anon_inode:[eventpoll]'
lr-x------ 1 ceph ceph 64 Apr 25 00:34 4 -> 'pipe:[4963976]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 5 -> 'pipe:[4963976]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 6 -> 'anon_inode:[eventpoll]'
lr-x------ 1 ceph ceph 64 Apr 25 00:45 7 -> 'pipe:[4963977]'
l-wx------ 1 ceph ceph 64 Apr 25 00:45 8 -> 'pipe:[4963977]'
lrwx------ 1 ceph ceph 64 Apr 25 00:45 9 -> 'anon_inode:[eventpoll]'

Here perhaps the OSD itself does not do something that can block (such as `df`) itself, but talks via pipe or socket to a process that does it (e.g. the MON).

From my understanding, while Ceph recommends keeping mons, ODS, and clients separate, Ceph wants to enable hyperconverged setups as well.

So the issue is:

Is there anything in one of the startup paths of any of the daemons that does the equivalent of an unconstrained `df` (or any other global FS operation that can block when a CephFS mount is down), which could cause this?

If yes, can it be removed?

Actions #1

Updated by Niklas Hambuechen about 1 year ago

For completeness, the last lines of the stuck OSD in "/var/log/ceph/ceph-osd.0.log":

2023-04-24T23:44:14.265+0000 7f57697fd640 -1 received  signal: Terminated from systemd  (PID: 1) UID: 0
2023-04-24T23:44:14.265+0000 7f57697fd640 -1 osd.0 388 *** Got signal Terminated ***
2023-04-24T23:44:14.265+0000 7f57697fd640 -1 osd.0 388 *** Immediate shutdown (osd_fast_shutdown=true) ***

However, the OSD process is still running (I could strace it).

Actions #2

Updated by Niklas Hambuechen about 1 year ago

The MDS:

2023-04-24T23:44:14.239+0000 7f1e46f96640 -1 received  signal: Terminated from systemd  (PID: 1) UID: 0
2023-04-24T23:44:14.239+0000 7f1e46f96640 -1 mds.test-node-1 *** got signal Terminated ***
2023-04-24T23:44:14.239+0000 7f1e46f96640  1 mds.test-node-1 suicide! Wanted state up:standby

The MONs print stuff in regular intervals, e.g. one of them:

2023-04-25T01:27:03.697+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.0.133 (gid: 4285 addr: v1:10.0.0.2:6801/2522866916 state: up:active) since 6152.26
2023-04-25T01:27:03.697+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.-1.0 (gid: 114105 addr: v1:10.0.0.1:6801/323394775 state: up:standby) since 6152.26
2023-04-25T01:27:03.697+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.-1.0 (gid: 114645 addr: v1:10.0.0.3:6801/377201059 state: up:standby) since 6152.26

2023-04-25T01:27:08.701+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.0.133 (gid: 4285 addr: v1:10.0.0.2:6801/2522866916 state: up:active) since 6157.26
2023-04-25T01:27:08.701+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.-1.0 (gid: 114105 addr: v1:10.0.0.1:6801/323394775 state: up:standby) since 6157.26
2023-04-25T01:27:08.701+0000 7f26eb9fb640  1 mon.test-node-1@0(leader).mds e347 no beacon from mds.-1.0 (gid: 114645 addr: v1:10.0.0.3:6801/377201059 state: up:standby) since 6157.26

`ceph status` hangs forever on all 3 machines.

I should also write that I encountered this while investigating further on https://tracker.ceph.com/issues/53751.

The reason `ceph status` hangs here is because due to that issue, my mon is not listening on port 3300.

So it's definitely possible I'm conflating two things too much here.

Actions

Also available in: Atom PDF