Bug #10930
closed
mon: map_cache can become inaccurate if osd does not receive the osdmaps
Added by Sage Weil about 9 years ago.
Updated about 8 years ago.
Description
We update map_cache when we send the map, but the osd session may reset and it may not receive it. And then it will never get the map after that.
Observed this preventing an osd from booting (it was stuck asking for maps). On firefly 0.80.8-25-g86d3835
Add a debug option to probalistically (or every 5th time?) drop the message and force the monclient to reconnect.
- Assignee changed from Sage Weil to Kefu Chai
- Regression set to No
Observed this preventing an osd from booting (it was stuck asking for maps).
if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect()
will get called when the monc reconnects to the monitor. in which, OSD will start_boot()
since it is_booting()
. and in start_boot()
, OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.
so i think i must be missing something here, Sam and Sage, could you guys shed some light on it? thanks!
Kefu Chai wrote:
Observed this preventing an osd from booting (it was stuck asking for maps).
if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect()
will get called when the monc reconnects to the monitor. in which, OSD will start_boot()
since it is_booting()
. and in start_boot()
, OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.
... except that osd_epoch cache on the mon will still say that the osd already got the OSDMap (that it didn't), so the subscribe will be ignored. See check_subs() -> send_incremental().
I think the fix is to destroy the osd_epoch map<> and move that information into the MonSession, so that when the connection is lost the state gets cleaned up.
thanks sage, i got it now =)
- Status changed from New to In Progress
- Status changed from In Progress to Fix Under Review
- Status changed from Fix Under Review to Pending Backport
- Backport set to hammer,firefly
- Backport changed from hammer,firefly to hammer
Dropping firefly backport because it is non-trivial and the cost-benefit analysis is unclear (firefly is getting long in the tooth).
- Status changed from Pending Backport to Resolved
Also available in: Atom
PDF