Fix #3188
openosd: close read hole
0%
Description
client and now-marked-down osd with old map may continue to read.
solution probably goes something like this:
- if the primary does not hear from the replicas (via the heartbeats) in a heartbeat_grace period, it will stop servicing reads.
- any new primary already contacts 'up' osds, but ignores down osds. modify this behavior to also probe 'down' osds to make sure they are down. if that is successful, go active immediately. if not, wait until the heartbeat_grace period has expired to be sure the old primary is no longer servicing reads.
- as a refinement of the above, we go active only only delay writes until the timer expires; reads are of course safe as no data has changed.
Updated by Ian Colle over 11 years ago
- Tracker changed from Bug to Feature
- Priority changed from High to Normal
Updated by Ian Colle almost 11 years ago
- Translation missing: en.field_story_points set to 13.00
Updated by Sage Weil almost 11 years ago
- Status changed from New to 12
pushed wip-osd-readhole with some old incomplete work on this. here's a brain dump of where my thinking is/was on this.
the basic idea is that the primary will stop servicing reads once if it hasn't gotten an ping ack from its replicas in heartbeat_interval seconds. if the pg mapping changes but no osds go down, this is not really a problem; the peering messages that get exchanged ensure the peer has the latest map. but in the case that an osd goes down, we want to know that the peer osd saw that map, or the best upper bound on when the last replica ack it could have gotten was.
1. block reads after heartbeat_interval seconds without a replica ack.
2. after peering, wait heartbeat_interval seconds after our upper bound on when the last interval ended. initially assume this is now. that is technically sufficient to close the hole, but will introduce long delays each time a new peering interval starts.
down osds are the culprit. There are 3 cases to consider:
1. normal osd failure -- the last ack sent to the failed osd by old replicas needs to be communicated to the new primary.
- keep track of when last hb was acked for all hb peers
- share that with the primary during peering.
- make primary use that information to try to build a lower upper bound on the last ack the old primary could have received.
2. osd marks itself down -- new primaries needs to know it knew it was going down.
- mark this in the osdmap somehow?
3. 'ceph osd down NNN' or wrongly marked down.
- after we are (wrongly) marked down, keep answering pings on the old hb interface for heartbeat interval seconds.
- in hb acks, share our min(osdmap epoch) across pgs
- make replicas share old primary min_pg_epoch_consumed value with new primaries?
unfortunately lots of different threads here, all aiming to build a better upper bound on when the down osd must have stopped processing reads.
Updated by Samuel Just almost 11 years ago
- Status changed from 12 to In Progress
- Assignee set to Samuel Just