Project

General

Profile

Actions

Bug #44612

open

mgr stuck in activating state

Added by Ernesto Puerta about 4 years ago. Updated about 4 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Silent crash. No stack trace this time, no core dump (at least not that I could find of)...

Not a crash, but a hang. The ceph-mgr process is still there but not responding to dashboard requests:

# ps ax | grep mgr
 3401 ?        Ssl    2:21 /usr/bin/ceph-mgr -i x -c /ceph/build.ceph-rpm2/ceph.conf

Additionally a ceph-mgr dump, gives a weird "active_addr": "(unrecognized address family 0)/0":

# ceph mgr dump
{
    "epoch": 29,
    "active_gid": 0,
    "active_name": "",
    "active_addrs": {
        "addrvec": []
    },
    "active_addr": "(unrecognized address family 0)/0",
    "active_change": "2020-03-15T20:57:25.498546+0000",
    "active_mgr_features": 0,
    "available": false,
    "standbys": [],
..

No ceph mgr services are reported:

# ceph mgr services
{}

When a failover is forced:

# ceph mgr fail x
Daemon not found 'x', already failed?
The environment is:
  • RPM based install: ceph-base-15.1.0-1962.gbf6df4b.el8.x86_64
  • CentOS 8 docker container running vstart with 1 Mon, 1 Mgr and 3 OSDs. No RGW/MDS running.

I tried to trigger it by repeatedly restarting ceph-dashboard a few times (I don't think this one really matters), and with some specific dashboard page ("cluster -> configuration" page, which triggers mgr.get('config_options')). That's by default refreshed every 5 secs, but I reduced that period down to 1 sec and that makes the crash show up sooner (non deterministic, ranges from a few seconds after the restart to some minutes). It all seems to point to some race condition/deadlock?

No relevant logs at mgr log:

2020-03-13T20:09:22.836+0000 7fd6ed6b4700  1 --2- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6802/1099,v1:172.20.0.2:6803/1099] conn(0x557ae8770c00 0x557ae87c3680 secure :-1 s=READY pgs=22 cs=0 l=1 rx=0x557ae8630f90 tx=0x557ae85fae10).stop
2020-03-13T20:09:22.836+0000 7fd6eceb3700  1 -- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6818/1742,v1:172.20.0.2:6819/1742] conn(0x557ae8770000 msgr2=0x557ae856f200 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault.
2020-03-13T20:09:22.836+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/2029205900 reap_dead start
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 54
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).stop
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault.

strace shows some activity: writing logs and sending and receiving network packtets

# sudo strace -e 'trace=!futex' -s120 -ff -p 29970

[pid 29976] <... epoll_wait resumed>[{EPOLLIN, {u32=17, u64=17}}], 5000, 30000) = 1
[pid 29976] read(17, "c", 256)          = 1
[pid 29976] read(17, 0x7f56fa8371b0, 256) = -1 EAGAIN (Resource temporarily unavailable)
[pid 29976] sendmsg(31, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="cz\t\333\241\24o\210\3\360\32\362\317f t\247\375\350b\326e\235\300\366T\27\330:\247\366\257\265+\210\211\223\235\311\3608{\206\35\217\5K\261'\200=\230\222\253,\232\0342\0\353F\v~\341", iov_len=64}, {iov_base="\365\10B\"\235\r1M\rG\241\262!M\177\225", iov_len=16}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 80
[pid 29976] epoll_wait(16, [{EPOLLIN, {u32=31, u64=31}}], 5000, 30000) = 1
[pid 29976] read(31, "\2\320\334\23I\200\274\247\21/\313P \273\274,&G\270~w\271\340!%\242\240$:\252B\"E\224t\3744y\217\213Q\306\271{(<R#6\214\367X\301\331&)\244\273'\360\5\227\324\232\37\262|{\24\203\336@\363\322\275\206b.\21\0", 4096) = 80
[pid 29976] read(31, 0x559139ef8000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 29976] epoll_wait(16,  <unfinished ...>
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid  4576] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30039] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30039] getpid()                    = 3401
[pid 30039] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 29977] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30022] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30043] <... restart_syscall resumed>) = 0
[pid 30043] poll([{fd=40, events=POLLIN}], 1, 500 <unfinished ...>
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30014] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30014] write(15, "c", 1 <unfinished ...>
[pid 29971] write(5, "2020-03-16T09:56:00.178+0000 7f56de8e9700  0 log_channel(cluster) log [DBG] : pgmap v1529: 33 pgs: 33 active+clean; 0 B "..., 164 <unfinished ...>

FD 23 is

# ls -l /proc/3401/fd/23 
... /proc/3401/fd/23 -> 'socket:[49145173]'

# sudo lsof -i -nP | grep 49145173
ceph-mgr  3401 root   23u  IPv4 49145173      0t0  TCP 172.20.0.2:49828->172.20.0.2:10000 (ESTABLISHED)


Files

mgr.log (162 KB) mgr.log Alec Blayne, 04/05/2020 01:47 PM

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #44642: cephadm: mgr dump might be too hugeResolvedSebastian Wagner

Actions
Actions #1

Updated by Lenz Grimmer about 4 years ago

  • Regression changed from No to Yes
  • Severity changed from 3 - minor to 2 - major
  • Affected Versions v15.0.0 added
Actions #2

Updated by Ernesto Puerta about 4 years ago

  • Subject changed from mgr: crashes strike back to mgr: ceph-mgr hangs, "active_addr: (unrecognized address family 0)/0"
  • Description updated (diff)
  • Priority changed from Normal to High
Actions #3

Updated by Sebastian Wagner about 4 years ago

  • Related to Bug #44642: cephadm: mgr dump might be too huge added
Actions #4

Updated by Sage Weil about 4 years ago

  • Subject changed from mgr: ceph-mgr hangs, "active_addr: (unrecognized address family 0)/0" to mgr stuck in activating state
  • Status changed from New to Need More Info

Key thing here is that available=false in the mgr dump above. This is a normal state between when the old mgr fails and the mon tells the new mgr to take over, and when the new mgr has started up all of the python modules and is ready+available.

I'm not sure why the mgr would get stuck in this state, though... it would be helpful to see the full mgr log.

Actions #5

Updated by Ernesto Puerta about 4 years ago

I'll try to reproduce it. Nevertheless, take into account that this a single-mgr deployment and no failover has been manually triggered (the module reloading happened long before this).

ceph-mgr logs in default log level were not very informative (just pgmap version updates), and in max log verbosity it's too verbose (I will anyway try to capture and upload).

Actions #6

Updated by Alec Blayne about 4 years ago

I'm suffering from this issue after upgrading from 14.2.8 to 15.2.0 (gentoo system).
All the mgr's (three) hang just like described everytime.
I'm attaching the output from starting a ceph-mgr in debug level 10.

Actions

Also available in: Atom PDF