Project

General

Profile

Actions

Bug #38228

closed

Bug #38094: mgr: crash list

mgr: NO TRACE DUMP - "reap_dead start"

Added by Ernesto Puerta about 5 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

vstart. CentOS7/container.

Last messages in ceph-mgr log:

...
2019-02-07 19:08:11.506 7fc272a4b700  4 mgr.server handle_report from 0x5603d19ebc00 osd,2           
2019-02-07 19:08:11.506 7fc272a4b700 20 mgr.server handle_report updating existing DaemonState for osd,2
2019-02-07 19:08:11.506 7fc272a4b700 20 mgr update loading 0 new types, 0 old types, had 234 types, got 782 bytes of data
2019-02-07 19:08:11.506 7fc272a4b700 10 mgr.server handle_report daemon_health_metrics [SLOW_OPS(0|(0,0)),PENDING_CREATING_PGS(0|(0,0))]
2019-02-07 19:08:11.507 7fc272a4b700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] <== osd.2 v2:172.20.0.5:6818/1723 191 ==== pg_stats(9 pgs tid 0 v 0) v2 ==== 6333+0+0 (3818040124 0 0) 0x5603d1aba000 con 0x5603d19ebc00
2019-02-07 19:08:13.344 7fc285098700  1 -- v2:172.20.0.5:0/3190 <== mon.0 v2:172.20.0.5:10000/0 3729 ==== mgrdigest v1 ==== 1614+0+0 (887795601 0 0) 0x5603d24d9b00 con 0x5603ce3aec00
2019-02-07 19:08:18.391 7fc284096700 10 monclient: tick                                              
2019-02-07 19:08:18.391 7fc284096700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-02-07 19:07:48.392750)
2019-02-07 19:08:18.391 7fc284096700 10 log_client  log_queue is 1 last_log 2037 sent 2036 num 1 unsent 1 sending 1
2019-02-07 19:08:18.391 7fc284096700 10 log_client  will send 2019-02-07 19:08:08.876844 mgr.x (mgr.34343) 2037 : cluster [DBG] pgmap v2032: 48 pgs: 48 active+clean; 6.4 KiB data, 207 MiB used, 27 GiB / 30 GiB avail
2019-02-07 19:08:18.391 7fc284096700 10 monclient: _send_mon_message to mon.a at v2:172.20.0.5:10000/0
2019-02-07 19:08:18.391 7fc284096700  1 -- v2:172.20.0.5:0/3190 --> [v2:172.20.0.5:10000/0,v1:172.20.0.5:10001/0] -- log(1 entries from seq 2037 at 2019-02-07 19:08:08.876844) v1 -- 0x5603d24d6b40 con 0x5603ce3aec00
2019-02-07 19:08:41.207 7fc28889f700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6802/909,v1:172.20.0.5:6803/909] conn(0x5603d184d800 msgr2=0x5603d1b34800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 
2019-02-07 19:08:41.207 7fc28889f700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6802/909,v1:172.20.0.5:6803/909] conn(0x5603d184d800 msgr2=0x5603d1b34800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
...
2019-02-07 19:08:41.219 7fc28a0a2700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/3665159472 conn(0x5603d184dc00 0x5603d2215700 :-1 s=READY pgs=16 cs=0 l=1).stop
2019-02-07 19:08:41.219 7fc28809e700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 msgr2=0x5603d140a800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close fi
2019-02-07 19:08:41.219 7fc28809e700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 msgr2=0x5603d140a800 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-02-07 19:08:41.219 7fc28809e700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 0x5603d140a800 :-1 s=READY pgs=28 cs=0 l=1).handle_read_frame_length_and_tag read 
2019-02-07 19:08:41.219 7fc28809e700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> [v2:172.20.0.5:6828/1707578525,v1:172.20.0.5:6829/1707578525] conn(0x5603d19ea800 0x5603d140a800 :-1 s=READY pgs=28 cs=0 l=1).stop
2019-02-07 19:08:41.224 7fc28809e700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 msgr2=0x5603d2883e00 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 47
2019-02-07 19:08:41.224 7fc28809e700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 msgr2=0x5603d2883e00 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-02-07 19:08:41.224 7fc28809e700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 0x5603d2883e00 :-1 s=READY pgs=10 cs=0 l=1).handle_read_frame_length_and_tag read frame length and tag failed r=-1 ((1) Oper
2019-02-07 19:08:41.224 7fc28809e700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/196 conn(0x5603d2670000 0x5603d2883e00 :-1 s=READY pgs=10 cs=0 l=1).stop
2019-02-07 19:08:41.224 7fc28809e700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] reap_dead start
2019-02-07 19:08:41.226 7fc28889f700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 msgr2=0x5603d2885200 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 49
2019-02-07 19:08:41.226 7fc28889f700  1 -- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 msgr2=0x5603d2885200 :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-02-07 19:08:41.226 7fc28889f700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 0x5603d2885200 :-1 s=READY pgs=10 cs=0 l=1).handle_read_frame_length_and_tag read frame length and tag failed r=-1 ((1) Oper
2019-02-07 19:08:41.226 7fc28889f700  1 --2- [v2:172.20.0.5:6800/3190,v1:172.20.0.5:6801/3190] >> v2:172.20.0.5:0/237 conn(0x5603d2669400 0x5603d2885200 :-1 s=READY pgs=10 cs=0 l=1).stop                                                                                                                                                                                                                                                                         
Actions #1

Updated by Ernesto Puerta about 5 years ago

Another occurrence of this one. No trace dump either. Same "reap_dead start" and v2 mon connection messages.

Actions #2

Updated by Kefu Chai about 5 years ago

Ernesto, how this crashed mgr? and i think if it did, it should be a bug in msgr, right?

Actions #3

Updated by Ernesto Puerta about 5 years ago

Hey Kefu, I'm not familiar to msgr2 issues. Not sure it that last trace points to some issue of it. When I found the mgr was not running, no trace dump was printed to the Mgr log, and the last meaningful messages there were related to v2 connection state changes. As you may know there've been lots of crashes around Mgr (Sage recently merged a PR that probably has fixed all of these). Is there any v2 ongoing issue related to the above traces?

Actions #4

Updated by Ernesto Puerta about 5 years ago

  • Status changed from New to Closed
  • Priority changed from High to Normal
  • Severity changed from 2 - major to 3 - minor

Closing this as this hasn't happened again.

Actions

Also available in: Atom PDF