Project

General

Profile

Actions

Bug #58379

closed

no active mgr after ~1 hour

Added by Nitzan Mordechai over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
MonClient
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After checking the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2106031
i was able to recreate the issue on main branch. Running regular vstart with 3 osds 1 mon and 3 mgr
i also enabled some mgr modules: prometheus, cephadm, alerts
added the config ms_inject_socket_failures = 200 and let the cluster run for a while
ceph -s reports that no active mgr and mgr logs shows that mgrbeacon was not sent after some time, last send_beacon shows:

2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): get_auth_request con 0x55683022ec00 auth_method 0
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): get_auth_request method 2 preferred_modes [2,1]
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): _init_auth method 2
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): _init_auth already have auth, reseting
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more payload 9
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more payload_len 9
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more responding with 132 bytes
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_done global_id 4377 payload 293
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 -- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 msgr2=0x556837226680 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 --2- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 0x556837226680 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient: _finish_hunting -11
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 monclient: no mon sessions established

from that tick, the monclient won't send any mgrbeacon:

2022-12-29T10:43:52.373+0000 7ffaedd7d700 20 mgr send_beacon standby
2022-12-29T10:43:52.373+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377
2022-12-29T10:43:52.820+0000 7ffaf0582700 10 monclient: tick
2022-12-29T10:43:52.820+0000 7ffaf0582700 10 monclient: _check_auth_tickets
2022-12-29T10:43:53.212+0000 7ffaefd81700  1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x556836d4ea80 con 0x556837074000
2022-12-29T10:43:54.212+0000 7ffaefd81700  1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x5568314ddc00 con 0x556837074000
2022-12-29T10:43:54.385+0000 7ffaedd7d700 10 mgr tick tick
2022-12-29T10:43:54.385+0000 7ffaedd7d700 20 mgr send_beacon standby
2022-12-29T10:43:54.385+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377
2022-12-29T10:43:55.213+0000 7ffaefd81700  1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x5568372ef500 con 0x556837074000
2022-12-29T10:43:55.821+0000 7ffaf0582700 10 monclient: tick
2022-12-29T10:43:55.821+0000 7ffaf0582700 10 monclient: _check_auth_tickets
2022-12-29T10:43:56.213+0000 7ffaefd81700  1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x556836fae000 con 0x556837074000
2022-12-29T10:43:56.398+0000 7ffaedd7d700 10 mgr tick tick
2022-12-29T10:43:56.398+0000 7ffaedd7d700 20 mgr send_beacon standby
2022-12-29T10:43:56.398+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #61740: pacific: no active mgr after ~1 hour ResolvedRadoslaw ZarzynskiActions
Copied to RADOS - Backport #61741: quincy: no active mgr after ~1 hour ResolvedRadoslaw ZarzynskiActions
Copied to RADOS - Backport #61742: reef: no active mgr after ~1 hour ResolvedRadoslaw ZarzynskiActions
Actions #1

Updated by Nitzan Mordechai over 1 year ago

When the message :

2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more responding with 132 bytes
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_done global_id 4377 payload 293
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 -- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 msgr2=0x556837226680 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 --2- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 0x556837226680 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient: _finish_hunting -11
2022-12-29T10:43:51.921+0000 7ffaf6d8f700  1 monclient: no mon sessions established

that means that handle_auth_done returned with -11 (EAGAIN) but we're also closing the connection and active_con will stay unassign.
the next tick will check if we are in hunting or if we have active_con, since the session is closed, we won't do anything, and won't send_beacon anymore, the mgr will be inactive for mon.
my suggestion is to check at the tick time is we are opened and if not we will reopen the session:

 if (!_opened()) {
    ldout(cct, 10) << __func__ << " not opened." << dendl;
    _reopen_session();
  }
Actions #2

Updated by Nitzan Mordechai over 1 year ago

  • Status changed from New to In Progress
Actions #3

Updated by Nitzan Mordechai over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 49783
Actions #4

Updated by Vikhyat Umrao about 1 year ago

  • Backport set to pacific, quincy
Actions #5

Updated by Radoslaw Zarzynski about 1 year ago

Review-in-progress.

Actions #6

Updated by Nitzan Mordechai 12 months ago

  • Pull request ID changed from 49783 to 51424
Actions #7

Updated by Radoslaw Zarzynski 10 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Radoslaw Zarzynski 10 months ago

  • Backport changed from pacific, quincy to pacific,quincy,reef
Actions #9

Updated by Backport Bot 10 months ago

Actions #10

Updated by Backport Bot 10 months ago

Actions #11

Updated by Backport Bot 10 months ago

Actions #12

Updated by Backport Bot 10 months ago

  • Tags set to backport_processed
Actions #13

Updated by Neha Ojha 5 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF