Project

General

Profile

Bug #44612

Updated by Ernesto Puerta about 4 years ago

-Silent Silent crash. No stack trace this time, no core dump core-dump (at least not that I could find of)...- of)... 

 Not a crash, but a hang. The ceph-mgr process is still there but not responding to dashboard requests: 

 <pre> 
 # ps ax | grep mgr 
  3401 ?          Ssl      2:21 /usr/bin/ceph-mgr -i x -c /ceph/build.ceph-rpm2/ceph.conf 
 </pre> 

 Additionally a ceph-mgr dump, gives a weird @"active_addr": "(unrecognized address family 0)/0"@: 

 <pre> 
 # ceph mgr dump 
 { 
     "epoch": 29, 
     "active_gid": 0, 
     "active_name": "", 
     "active_addrs": { 
         "addrvec": [] 
     }, 
     "active_addr": "(unrecognized address family 0)/0", 
     "active_change": "2020-03-15T20:57:25.498546+0000", 
     "active_mgr_features": 0, 
     "available": false, 
     "standbys": [], 
 .. 
 </pre> 


 No ceph mgr services are reported: 

 <pre> 
 # ceph mgr services 
 {} 
 </pre> 

 When a failover is forced: 

 <pre> 
 # ceph mgr fail x 
 Daemon not found 'x', already failed? 
 </pre> 

 The environment is: 
 * RPM based install: ceph-base-15.1.0-1962.gbf6df4b.el8.x86_64 
 * CentOS 8 docker container running vstart with 1 Mon, 1 Mgr and 3 OSDs. No RGW/MDS running. 

 I tried managed to trigger it by repeatedly restarting ceph-dashboard a few times (I don't think this one really matters), times, and with some specific dashboard page ("cluster -> configuration" page, which triggers mgr.get('config_options')). That's by default refreshed every 5 secs, but I reduced that period down to 1 sec and that makes the crash show up sooner (non deterministic, ranges from a few seconds after the restart to some minutes). It all seems to point to some race condition/deadlock? 

 No relevant logs at mgr log: 
 <pre> 
 2020-03-13T20:09:22.836+0000 7fd6ed6b4700    1 --2- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6802/1099,v1:172.20.0.2:6803/1099] conn(0x557ae8770c00 0x557ae87c3680 secure :-1 s=READY pgs=22 cs=0 l=1 rx=0x557ae8630f90 tx=0x557ae85fae10).stop 
 2020-03-13T20:09:22.836+0000 7fd6eceb3700    1 -- 172.20.0.2:0/2029205900 >> [v2:172.20.0.2:6818/1742,v1:172.20.0.2:6819/1742] conn(0x557ae8770000 msgr2=0x557ae856f200 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault. 
 2020-03-13T20:09:22.836+0000 7fd6eeeb7700    1 -- 172.20.0.2:0/2029205900 reap_dead start 
 2020-03-13T20:09:25.354+0000 7fd6eeeb7700    1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 54 
 2020-03-13T20:09:25.354+0000 7fd6eeeb7700    1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 
 2020-03-13T20:09:25.354+0000 7fd6eeeb7700    1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted) 
 2020-03-13T20:09:25.354+0000 7fd6eeeb7700    1 --2- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 0x557ae8ba6480 secure :-1 s=READY pgs=25 cs=0 l=1 rx=0x557ae8b88960 tx=0x557ae8b1cf00).stop 
 2020-03-13T20:09:25.354+0000 7fd6eeeb7700    1 -- 172.20.0.2:0/3138367003 >> [v2:172.20.0.2:6810/1425,v1:172.20.0.2:6811/1425] conn(0x557ae89f3400 msgr2=0x557ae8ba6480 unknown :-1 s=STATE_CLOSED l=1).tick see no progress in more than 10000000 us during connecting, fault. 
 </pre> 


 @strace@ shows some activity: writing logs and sending and receiving network packtets 
 <pre> 
 # sudo strace -e 'trace=!futex' -s120 -ff -p 29970 

 [pid 29976] <... epoll_wait resumed>[{EPOLLIN, {u32=17, u64=17}}], 5000, 30000) = 1 
 [pid 29976] read(17, "c", 256)            = 1 
 [pid 29976] read(17, 0x7f56fa8371b0, 256) = -1 EAGAIN (Resource temporarily unavailable) 
 [pid 29976] sendmsg(31, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="cz\t\333\241\24o\210\3\360\32\362\317f t\247\375\350b\326e\235\300\366T\27\330:\247\366\257\265+\210\211\223\235\311\3608{\206\35\217\5K\261'\200=\230\222\253,\232\0342\0\353F\v~\341", iov_len=64}, {iov_base="\365\10B\"\235\r1M\rG\241\262!M\177\225", iov_len=16}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 80 
 [pid 29976] epoll_wait(16, [{EPOLLIN, {u32=31, u64=31}}], 5000, 30000) = 1 
 [pid 29976] read(31, "\2\320\334\23I\200\274\247\21/\313P \273\274,&G\270~w\271\340!%\242\240$:\252B\"E\224t\3744y\217\213Q\306\271{(<R#6\214\367X\301\331&)\244\273'\360\5\227\324\232\37\262|{\24\203\336@\363\322\275\206b.\21\0", 4096) = 80 
 [pid 29976] read(31, 0x559139ef8000, 4096) = -1 EAGAIN (Resource temporarily unavailable) 
 [pid 29976] epoll_wait(16,    <unfinished ...> 
 [pid 30079] <... select resumed>)         = 0 (Timeout) 
 [pid 30048] <... select resumed>)         = 0 (Timeout) 
 [pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid    4576] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) 
 [pid 30039] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) 
 [pid 30039] getpid()                      = 3401 
 [pid 30039] stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0 
 [pid 30079] <... select resumed>)         = 0 (Timeout) 
 [pid 30048] <... select resumed>)         = 0 (Timeout) 
 [pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 29977] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) 
 [pid 30022] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) 
 [pid 30022] getpid()                      = 3401 
 [pid 30022] getpid()                      = 3401 
 [pid 30022] getpid()                      = 3401 
 [pid 30022] getpid()                      = 3401 
 [pid 30079] <... select resumed>)         = 0 (Timeout) 
 [pid 30048] <... select resumed>)         = 0 (Timeout) 
 [pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30043] <... restart_syscall resumed>) = 0 
 [pid 30043] poll([{fd=40, events=POLLIN}], 1, 500 <unfinished ...> 
 [pid 30048] <... select resumed>)         = 0 (Timeout) 
 [pid 30079] <... select resumed>)         = 0 (Timeout) 
 [pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30048] <... select resumed>)         = 0 (Timeout) 
 [pid 30079] <... select resumed>)         = 0 (Timeout) 
 [pid 30048] select(38, [37], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30079] select(39, [38], [], [], {tv_sec=0, tv_usec=100000} <unfinished ...> 
 [pid 30014] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) 
 [pid 30014] write(15, "c", 1 <unfinished ...> 
 [pid 29971] write(5, "2020-03-16T09:56:00.178+0000 7f56de8e9700    0 log_channel(cluster) log [DBG] : pgmap v1529: 33 pgs: 33 active+clean; 0 B "..., 164 <unfinished ...> 
 </pre> 

 FD 23 is  
 <pre> 
 # ls -l /proc/3401/fd/23  
 ... /proc/3401/fd/23 -> 'socket:[49145173]' 

 # sudo lsof -i -nP | grep 49145173 
 ceph-mgr    3401 root     23u    IPv4 49145173        0t0    TCP 172.20.0.2:49828->172.20.0.2:10000 (ESTABLISHED) 
 </pre> Any suggestions to further debug this? valgrind (but I don't see SEGFAULTs)? installing debug symbols?

Back