Project

General

Profile

Bug #10930

mon: map_cache can become inaccurate if osd does not receive the osdmaps

Added by Sage Weil almost 8 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We update map_cache when we send the map, but the osd session may reset and it may not receive it. And then it will never get the map after that.

Observed this preventing an osd from booting (it was stuck asking for maps). On firefly 0.80.8-25-g86d3835


Related issues

Copied to Ceph - Backport #12835: mon: map_cache can become inaccurate if osd does not receive the osdmaps Resolved

Associated revisions

Revision c05753ea (diff)
Added by Kefu Chai over 7 years ago

mon: track osd_epoch in MonSession

  • remove osd_epoch<osd, epoch> from OSDMonitor
  • add osd_epoch to MonSession to track the latest osdmap epoch
    OSDMonitor sends to a mon client
  • do not remove osd_epoch entries if an OSD is down, or
    max_osd > osd_id

Fixes: #10930
Signed-off-by: Kefu Chai <>

Revision cc7da674 (diff)
Added by Kefu Chai about 7 years ago

mon: track osd_epoch in MonSession

  • remove osd_epoch<osd, epoch> from OSDMonitor
  • add osd_epoch to MonSession to track the latest osdmap epoch
    OSDMonitor sends to a mon client
  • do not remove osd_epoch entries if an OSD is down, or
    max_osd > osd_id

Fixes: #10930
Signed-off-by: Kefu Chai <>
(cherry picked from commit c05753eacc26e90b2e3b56e641a71bffd5b39bd0)

History

#1 Updated by Samuel Just almost 8 years ago

Add a debug option to probalistically (or every 5th time?) drop the message and force the monclient to reconnect.

#2 Updated by Sage Weil over 7 years ago

  • Assignee changed from Sage Weil to Kefu Chai
  • Regression set to No

#3 Updated by Kefu Chai over 7 years ago

Observed this preventing an osd from booting (it was stuck asking for maps).

if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect() will get called when the monc reconnects to the monitor. in which, OSD will start_boot() since it is_booting(). and in start_boot(), OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.

so i think i must be missing something here, Sam and Sage, could you guys shed some light on it? thanks!

#4 Updated by Sage Weil over 7 years ago

Kefu Chai wrote:

Observed this preventing an osd from booting (it was stuck asking for maps).

if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect() will get called when the monc reconnects to the monitor. in which, OSD will start_boot() since it is_booting(). and in start_boot(), OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.

... except that osd_epoch cache on the mon will still say that the osd already got the OSDMap (that it didn't), so the subscribe will be ignored. See check_subs() -> send_incremental().

I think the fix is to destroy the osd_epoch map<> and move that information into the MonSession, so that when the connection is lost the state gets cleaned up.

#5 Updated by Kefu Chai over 7 years ago

thanks sage, i got it now =)

#6 Updated by Kefu Chai over 7 years ago

  • Status changed from New to In Progress

#7 Updated by Kefu Chai over 7 years ago

  • Status changed from In Progress to Fix Under Review

#8 Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer,firefly

#9 Updated by Nathan Cutler over 7 years ago

  • Backport changed from hammer,firefly to hammer

Dropping firefly backport because it is non-trivial and the cost-benefit analysis is unclear (firefly is getting long in the tooth).

#10 Updated by Loïc Dachary almost 7 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF