Project

General

Profile

Actions

Bug #44612

open

mgr stuck in activating state

Added by Ernesto Puerta about 4 years ago. Updated about 4 years ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Silent crash. No stack trace this time, no core dump (at least not that I could find of)...

Not a crash, but a hang. The ceph-mgr process is still there but not responding to dashboard requests:

# ps ax | grep mgr
 3401 ?        Ssl    2:21 /usr/bin/ceph-mgr -i x -c /ceph/build.ceph-rpm2/ceph.conf

Additionally a ceph-mgr dump, gives a weird "active_addr": "(unrecognized address family 0)/0":

# ceph mgr dump
{
    "epoch": 29,
    "active_gid": 0,
    "active_name": "",
    "active_addrs": {
        "addrvec": []
    },
    "active_addr": "(unrecognized address family 0)/0",
    "active_change": "2020-03-15T20:57:25.498546+0000",
    "active_mgr_features": 0,
    "available": false,
    "standbys": [],
..

No ceph mgr services are reported:

# ceph mgr services
{}

When a failover is forced:

# ceph mgr fail x
Daemon not found 'x', already failed?
The environment is:
  • RPM based install: ceph-base-15.1.0-1962.gbf6df4b.el8.x86_64
  • CentOS 8 docker container running vstart with 1 Mon, 1 Mgr and 3 OSDs. No RGW/MDS running.

I tried to trigger it by repeatedly restarting ceph-dashboard a few times (I don't think this one really matters), and with some specific dashboard page ("cluster -> configuration" page, which triggers mgr.get('config_options')). That's by default refreshed every 5 secs, but I reduced that period down to 1 sec and that makes the crash show up sooner (non deterministic, ranges from a few seconds after the restart to some minutes). It all seems to point to some race condition/deadlock?

No relevant logs at mgr log:

2020-03-13T20:09:22.836+0000 7fd6ed6b4700  1 --2- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6802/1099,v1:172.20.0.2:6803/1099] conn(0x557ae8770c00 0x557ae87c3680 secure :-1 s=READY pgs=22 cs=0 l=1 rx=0x557ae8630f90 tx=0x557ae85fae10).stop
2020-03-13T20:09:22.836+0000 7fd6eceb3700  1 -- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6818/1742,v1:172.20.0.2:6819/1742] conn(0x557ae8770000 msgr2=0x557ae856f200 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault.
2020-03-13T20:09:22.836+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/2029205900 reap_dead start
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 54
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).stop
2020-03-13T20:09:25.354+0000 7fd6eeeb7700  1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault.

strace shows some activity: writing logs and sending and receiving network packtets

# sudo strace -e 'trace=!futex' -s120 -ff -p 29970

[pid 29976] <... epoll_wait resumed>[{EPOLLIN, {u32=17, u64=17}}], 5000, 30000) = 1
[pid 29976] read(17, "c", 256)          = 1
[pid 29976] read(17, 0x7f56fa8371b0, 256) = -1 EAGAIN (Resource temporarily unavailable)
[pid 29976] sendmsg(31, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="cz\t\333\241\24o\210\3\360\32\362\317f t\247\375\350b\326e\235\300\366T\27\330:\247\366\257\265+\210\211\223\235\311\3608{\206\35\217\5K\261'\200=\230\222\253,\232\0342\0\353F\v~\341", iov_len=64}, {iov_base="\365\10B\"\235\r1M\rG\241\262!M\177\225", iov_len=16}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 80
[pid 29976] epoll_wait(16, [{EPOLLIN, {u32=31, u64=31}}], 5000, 30000) = 1
[pid 29976] read(31, "\2\320\334\23I\200\274\247\21/\313P \273\274,&G\270~w\271\340!%\242\240$:\252B\"E\224t\3744y\217\213Q\306\271{(<R#6\214\367X\301\331&)\244\273'\360\5\227\324\232\37\262|{\24\203\336@\363\322\275\206b.\21\0", 4096) = 80
[pid 29976] read(31, 0x559139ef8000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 29976] epoll_wait(16,  <unfinished ...>
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid  4576] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30039] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30039] getpid()                    = 3401
[pid 30039] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 29977] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30022] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30022] getpid()                    = 3401
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30043] <... restart_syscall resumed>) = 0
[pid 30043] poll([{fd=40, events=POLLIN}], 1, 500 <unfinished ...>
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30048] <... select resumed>)       = 0 (Timeout)
[pid 30079] <... select resumed>)       = 0 (Timeout)
[pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...>
[pid 30014] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
[pid 30014] write(15, "c", 1 <unfinished ...>
[pid 29971] write(5, "2020-03-16T09:56:00.178+0000 7f56de8e9700  0 log_channel(cluster) log [DBG] : pgmap v1529: 33 pgs: 33 active+clean; 0 B "..., 164 <unfinished ...>

FD 23 is

# ls -l /proc/3401/fd/23 
... /proc/3401/fd/23 -> 'socket:[49145173]'

# sudo lsof -i -nP | grep 49145173
ceph-mgr  3401 root   23u  IPv4 49145173      0t0  TCP 172.20.0.2:49828->172.20.0.2:10000 (ESTABLISHED)


Files

mgr.log (162 KB) mgr.log Alec Blayne, 04/05/2020 01:47 PM

Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #44642: cephadm: mgr dump might be too hugeResolvedSebastian Wagner

Actions
Actions

Also available in: Atom PDF