Bug #38228
closedBug #38094: mgr: crash list
mgr: NO TRACE DUMP - "reap_dead start"
0%
Description
vstart. CentOS7/container.
Last messages in ceph-mgr log:
... 2019-02-07 19:08:11.506 7fc272a4b700 4 mgr.server handle_report from 0x5603d19ebc00 osd,2 2019-02-07 19:08:11.506 7fc272a4b700 20 mgr.server handle_report updating existing DaemonState for osd,2 2019-02-07 19:08:11.506 7fc272a4b700 20 mgr update loading 0 new types, 0 old types, had 234 types, got 782 bytes of data 2019-02-07 19:08:11.506 7fc272a4b700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))] 2019-02-07 19:08:11.507 7fc272a4b700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] <== osd.2 v2:172.20.0.5:6818/1723 191 ==== pg_stats(9 pgs tid 0 v 0) v2 ==== 6333+0+0 (3818040124 0 0) 0x5603d1aba000 con 0x5603d19ebc00 2019-02-07 19:08:13.344 7fc285098700 1 -- v2:172.20.0.5:0/3190 <== mon.0 v2:172.20.0.5:10000/0 3729 ==== mgrdigest v1 ==== 1614+0+0 (887795601 0 0) 0x5603d24d9b00 con 0x5603ce3aec00 2019-02-07 19:08:18.391 7fc284096700 10 monclient: tick 2019-02-07 19:08:18.391 7fc284096700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-02-07 19:07:48.392750) 2019-02-07 19:08:18.391 7fc284096700 10 log_client log_queue is 1 last_log 2037 sent 2036 num 1 unsent 1 sending 1 2019-02-07 19:08:18.391 7fc284096700 10 log_client will send 2019-02-07 19:08:08.876844 mgr.x (mgr.34343) 2037 : cluster [DBG] pgmap v2032: 48 pgs: 48 active+clean; 6.4 KiB data, 207 MiB used, 27 GiB / 30 GiB avail 2019-02-07 19:08:18.391 7fc284096700 10 monclient: _send_mon_message to mon.a at v2:172.20.0.5:10000/0 2019-02-07 19:08:18.391 7fc284096700 1 -- v2:172.20.0.5:0/3190 --> [v2:172.20.0.5:10000/0,v1:172.20.0.5:10001/0] -- log(1 entries from seq 2037 at 2019-02-07 19:08:08.876844) v1 -- 0x5603d24d6b40 con 0x5603ce3aec00 2019-02-07 19:08:41.207 7fc28889f700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6802/909,v1:172.20.0.5:6803/909] conn(0x5603d184d800 msgr2=0x5603d1b34800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 2019-02-07 19:08:41.207 7fc28889f700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6802/909,v1:172.20.0.5:6803/909] conn(0x5603d184d800 msgr2=0x5603d1b34800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed ... 2019-02-07 19:08:41.219 7fc28a0a2700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/3665159472 conn(0x5603d184dc00 0x5603d2215700 :-1 s=READY pgs=16 cs=0 l=1).stop 2019-02-07 19:08:41.219 7fc28809e700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 msgr2=0x5603d140a800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close fi 2019-02-07 19:08:41.219 7fc28809e700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 msgr2=0x5603d140a800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2019-02-07 19:08:41.219 7fc28809e700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 0x5603d140a800 :-1 s=READY pgs=28 cs=0 l=1).handle_read_frame_length_and_tag read 2019-02-07 19:08:41.219 7fc28809e700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 0x5603d140a800 :-1 s=READY pgs=28 cs=0 l=1).stop 2019-02-07 19:08:41.224 7fc28809e700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 msgr2=0x5603d2883e00 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 47 2019-02-07 19:08:41.224 7fc28809e700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 msgr2=0x5603d2883e00 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2019-02-07 19:08:41.224 7fc28809e700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 0x5603d2883e00 :-1 s=READY pgs=10 cs=0 l=1).handle_read_frame_length_and_tag read frame length and tag failed r=-1 ((1) Oper 2019-02-07 19:08:41.224 7fc28809e700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 0x5603d2883e00 :-1 s=READY pgs=10 cs=0 l=1).stop 2019-02-07 19:08:41.224 7fc28809e700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] reap_dead start 2019-02-07 19:08:41.226 7fc28889f700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 msgr2=0x5603d2885200 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 49 2019-02-07 19:08:41.226 7fc28889f700 1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 msgr2=0x5603d2885200 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2019-02-07 19:08:41.226 7fc28889f700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 0x5603d2885200 :-1 s=READY pgs=10 cs=0 l=1).handle_read_frame_length_and_tag read frame length and tag failed r=-1 ((1) Oper 2019-02-07 19:08:41.226 7fc28889f700 1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 0x5603d2885200 :-1 s=READY pgs=10 cs=0 l=1).stop
Updated by Ernesto Puerta about 5 years ago
Another occurrence of this one. No trace dump either. Same "reap_dead start" and v2 mon connection messages.
Updated by Kefu Chai about 5 years ago
Ernesto, how this crashed mgr? and i think if it did, it should be a bug in msgr, right?
Updated by Ernesto Puerta about 5 years ago
Hey Kefu, I'm not familiar to msgr2 issues. Not sure it that last trace points to some issue of it. When I found the mgr was not running, no trace dump was printed to the Mgr log, and the last meaningful messages there were related to v2 connection state changes. As you may know there've been lots of crashes around Mgr (Sage recently merged a PR that probably has fixed all of these). Is there any v2 ongoing issue related to the above traces?
Updated by Ernesto Puerta about 5 years ago
- Status changed from New to Closed
- Priority changed from High to Normal
- Severity changed from 2 - major to 3 - minor
Closing this as this hasn't happened again.