Bug #58379
closedno active mgr after ~1 hour
0%
Description
After checking the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2106031
i was able to recreate the issue on main branch. Running regular vstart with 3 osds 1 mon and 3 mgr
i also enabled some mgr modules: prometheus, cephadm, alerts
added the config ms_inject_socket_failures = 200 and let the cluster run for a while
ceph -s reports that no active mgr and mgr logs shows that mgrbeacon was not sent after some time, last send_beacon shows:
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): get_auth_request con 0x55683022ec00 auth_method 0 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): get_auth_request method 2 preferred_modes [2,1] 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): _init_auth method 2 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): _init_auth already have auth, reseting 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more payload 9 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more payload_len 9 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more responding with 132 bytes 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_done global_id 4377 payload 293 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 -- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 msgr2=0x556837226680 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 --2- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 0x556837226680 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient: _finish_hunting -11 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 monclient: no mon sessions established
from that tick, the monclient won't send any mgrbeacon:
2022-12-29T10:43:52.373+0000 7ffaedd7d700 20 mgr send_beacon standby 2022-12-29T10:43:52.373+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377 2022-12-29T10:43:52.820+0000 7ffaf0582700 10 monclient: tick 2022-12-29T10:43:52.820+0000 7ffaf0582700 10 monclient: _check_auth_tickets 2022-12-29T10:43:53.212+0000 7ffaefd81700 1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x556836d4ea80 con 0x556837074000 2022-12-29T10:43:54.212+0000 7ffaefd81700 1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x5568314ddc00 con 0x556837074000 2022-12-29T10:43:54.385+0000 7ffaedd7d700 10 mgr tick tick 2022-12-29T10:43:54.385+0000 7ffaedd7d700 20 mgr send_beacon standby 2022-12-29T10:43:54.385+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377 2022-12-29T10:43:55.213+0000 7ffaefd81700 1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x5568372ef500 con 0x556837074000 2022-12-29T10:43:55.821+0000 7ffaf0582700 10 monclient: tick 2022-12-29T10:43:55.821+0000 7ffaf0582700 10 monclient: _check_auth_tickets 2022-12-29T10:43:56.213+0000 7ffaefd81700 1 -- 172.21.5.153:0/2365183 --> [v2:172.21.5.153:6800/2365221,v1:172.21.5.153:6801/2365221] -- mgrreport(unknown.y +0-0 packed 54) v9 -- 0x556836fae000 con 0x556837074000 2022-12-29T10:43:56.398+0000 7ffaedd7d700 10 mgr tick tick 2022-12-29T10:43:56.398+0000 7ffaedd7d700 20 mgr send_beacon standby 2022-12-29T10:43:56.398+0000 7ffaedd7d700 10 mgr send_beacon sending beacon as gid 4377
Updated by Nitzan Mordechai over 1 year ago
When the message :
2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_reply_more responding with 132 bytes 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient(hunting): handle_auth_done global_id 4377 payload 293 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 -- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 msgr2=0x556837226680 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 --2- 172.21.5.153:0/2365183 >> [v2:172.21.5.153:40181/0,v1:172.21.5.153:40182/0] conn(0x55683022ec00 0x556837226680 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).stop 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 10 monclient: _finish_hunting -11 2022-12-29T10:43:51.921+0000 7ffaf6d8f700 1 monclient: no mon sessions established
that means that handle_auth_done returned with -11 (EAGAIN) but we're also closing the connection and active_con will stay unassign.
the next tick will check if we are in hunting or if we have active_con, since the session is closed, we won't do anything, and won't send_beacon anymore, the mgr will be inactive for mon.
my suggestion is to check at the tick time is we are opened and if not we will reopen the session:
if (!_opened()) { ldout(cct, 10) << __func__ << " not opened." << dendl; _reopen_session(); }
Updated by Nitzan Mordechai over 1 year ago
- Status changed from New to In Progress
Updated by Nitzan Mordechai over 1 year ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 49783
Updated by Nitzan Mordechai 12 months ago
- Pull request ID changed from 49783 to 51424
Updated by Radoslaw Zarzynski 10 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Radoslaw Zarzynski 10 months ago
- Backport changed from pacific, quincy to pacific,quincy,reef
Updated by Backport Bot 10 months ago
- Copied to Backport #61740: pacific: no active mgr after ~1 hour added
Updated by Backport Bot 10 months ago
- Copied to Backport #61741: quincy: no active mgr after ~1 hour added
Updated by Backport Bot 10 months ago
- Copied to Backport #61742: reef: no active mgr after ~1 hour added