Project

General

Profile

Bug #4543

mon: corrupted store if monitor dies mid-sync

Added by Joao Eduardo Luis about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Joao Eduardo Luis
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This came to my attention after spending some time figuring out what rzerre's (@ #ceph) issue with the monitor was.

It's still just a suspicion, and will update the bug with further information once I sort this out, but it looks like there's a fair chance of store corruption if the monitor dies, crashes or is shutdown in the middle of a sync.

More details to follow asap.

Details:

If a sync fails, when the monitor inits we will wipe out the store's contents for all services and Paxos, and we will then restart the sync. If this sync is aborted, the monitor killed or crashed, then all the services (and Paxos) won't have any data (worst case scenario) or will end up with a partial state (usually the most likely scenario when the user intervenes).

As a side effect, we may eventually end up without the monmap's versions -- and without a monmap we will be unable to start, let alone find the remaining cluster to sync from the next time.

There are two possible solutions:

  • Take a more conservative approach to how we sync, without cleaning up the whole store and instead sync whatever we have, plus whatever they send us and in the end check if everything is okay and clean up whatever we may not need. This has the drawback of putting an extra effort on the sync, and depending on the checks we make we may fall behind yet again (leading to a new sync and so forth).
  • Or we can backup the monmap before starting a sync. This approach would be the most straightforward: backup the monmap, wipe the store, sync, and once the sync is successful we can get rid of the backed up monmap (as we would then have synchronized the monmap's from the cluster).

Associated revisions

Revision c200cdb0
Added by Gregory Farnum almost 11 years ago

Merge pull request #225 from ceph/wip-4543

Fixes #4543

Reviewed-by: Greg Farnum <>

History

#1 Updated by Sage Weil almost 11 years ago

  • Priority changed from High to Urgent

#2 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Need More Info

#3 Updated by Ian Colle almost 11 years ago

Joao - could you please provide the additional info you promised "asap"?

#4 Updated by Joao Eduardo Luis almost 11 years ago

  • Description updated (diff)
  • Status changed from Need More Info to 4

Updated the original description with further details.

#5 Updated by Joao Eduardo Luis almost 11 years ago

  • Description updated (diff)

#6 Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from 4 to Fix Under Review

wip-4543 has a proposed fix -- haven't tested it yet.

#7 Updated by Greg Farnum almost 11 years ago

  • Status changed from Fix Under Review to In Progress

Comments on Github; and this is one that we'll definitely need to test before merging.

#8 Updated by Joao Eduardo Luis almost 11 years ago

  • Status changed from In Progress to Fix Under Review

Revised version and comments on github.

#9 Updated by Greg Farnum almost 11 years ago

New comments; should be quick to address; have you tested it?

#10 Updated by Greg Farnum almost 11 years ago

  • Status changed from Fix Under Review to Need More Info

#11 Updated by Greg Farnum almost 11 years ago

  • Status changed from Need More Info to In Progress

Whoops, wrong one before.

#12 Updated by Greg Farnum almost 11 years ago

  • Status changed from In Progress to Resolved

commit: c200cdb08108ae901c4c6f3625d55da707a38e5a

Also available in: Atom PDF