Project

General

Profile

Bug #4543

Updated by Joao Eduardo Luis about 11 years ago

This came to my attention after spending some time figuring out what rzerre's (@ #ceph) issue with the monitor was. 

 It's still just a suspicion, and will update the bug with further information once I sort this out, but it looks like there's a fair chance of store corruption if the monitor dies, crashes or is shutdown in the middle of a sync. 

 -More More details to follow asap.- 

 Details: 

 If a sync fails, when the monitor inits we will wipe out the store's contents for all services and Paxos, and we will then restart the sync.    If this sync is aborted, the monitor killed or crashed, then all the services (and Paxos) won't have any data (worst case scenario) or will end up with a partial state (usually the most likely scenario when the user intervenes). 

 As a side effect, we may eventually end up without the monmap's versions -- and without a monmap we will be unable to start, let alone find the remaining cluster to sync from the next time. 

 There are two possible solutions: 
  * Take a more conservative approach to how we sync, without cleaning up the whole store and instead sync whatever we have, plus whatever they send us and in the end check if everything is okay and clean up whatever we may not need.    This has the drawback of putting an extra effort on the sync, and depending on the checks we make we may fall behind yet again (leading to a new sync and so forth). 

  * Or we can backup the monmap before starting a sync.    This approach would be the most straightforward: backup the monmap, wipe the store, sync, and once the sync is successful we can get rid of the backed up monmap (as we would then have synchronized the monmap's from the cluster). asap.

Back