Project

General

Profile

Actions

Bug #10930

closed

mon: map_cache can become inaccurate if osd does not receive the osdmaps

Added by Sage Weil about 9 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We update map_cache when we send the map, but the osd session may reset and it may not receive it. And then it will never get the map after that.

Observed this preventing an osd from booting (it was stuck asking for maps). On firefly 0.80.8-25-g86d3835


Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #12835: mon: map_cache can become inaccurate if osd does not receive the osdmapsResolvedKefu ChaiActions
Actions #1

Updated by Samuel Just about 9 years ago

Add a debug option to probalistically (or every 5th time?) drop the message and force the monclient to reconnect.

Actions #2

Updated by Sage Weil almost 9 years ago

  • Assignee changed from Sage Weil to Kefu Chai
  • Regression set to No
Actions #3

Updated by Kefu Chai almost 9 years ago

Observed this preventing an osd from booting (it was stuck asking for maps).

if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect() will get called when the monc reconnects to the monitor. in which, OSD will start_boot() since it is_booting(). and in start_boot(), OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.

so i think i must be missing something here, Sam and Sage, could you guys shed some light on it? thanks!

Actions #4

Updated by Sage Weil almost 9 years ago

Kefu Chai wrote:

Observed this preventing an osd from booting (it was stuck asking for maps).

if an OSD is booting but the monclient session got reset, and the get_osdmap message was either dropped or its reply got dropped. the mon client should try to reopen_session() and re-authorize itself. and OSD::ms_handle_connect() will get called when the monc reconnects to the monitor. in which, OSD will start_boot() since it is_booting(). and in start_boot(), OSD will send another query to monitor for osdmaps. so the osdmap will be received anyway.

... except that osd_epoch cache on the mon will still say that the osd already got the OSDMap (that it didn't), so the subscribe will be ignored. See check_subs() -> send_incremental().

I think the fix is to destroy the osd_epoch map<> and move that information into the MonSession, so that when the connection is lost the state gets cleaned up.

Actions #5

Updated by Kefu Chai over 8 years ago

thanks sage, i got it now =)

Actions #6

Updated by Kefu Chai over 8 years ago

  • Status changed from New to In Progress
Actions #7

Updated by Kefu Chai over 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Kefu Chai over 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to hammer,firefly
Actions #9

Updated by Nathan Cutler over 8 years ago

  • Backport changed from hammer,firefly to hammer

Dropping firefly backport because it is non-trivial and the cost-benefit analysis is unclear (firefly is getting long in the tooth).

Actions #10

Updated by Loïc Dachary about 8 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF