Fix #3188: osd: close read hole - Ceph - Ceph

Actions

Copy link

Fix #3188

open

osd: close read hole

Added by Sage Weil over 11 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

CY2012

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

client and now-marked-down osd with old map may continue to read.

solution probably goes something like this:

if the primary does not hear from the replicas (via the heartbeats) in a heartbeat_grace period, it will stop servicing reads.
any new primary already contacts 'up' osds, but ignores down osds. modify this behavior to also probe 'down' osds to make sure they are down. if that is successful, go active immediately. if not, wait until the heartbeat_grace period has expired to be sure the old primary is no longer servicing reads.
as a refinement of the above, we go active only only delay writes until the timer expires; reads are of course safe as no data has changed.

Actions

Copy link

Updated by Sage Weil over 11 years ago

Priority changed from Urgent to High

Actions

Copy link

Updated by Ian Colle over 11 years ago

Tracker changed from Bug to Feature
Priority changed from High to Normal

Actions

Copy link

Updated by Sage Weil about 11 years ago

Tracker changed from Feature to Fix

Actions

Copy link

Updated by Ian Colle almost 11 years ago

Translation missing: en.field_story_points set to 13.00

Actions

Copy link

Updated by Ian Colle almost 11 years ago

Target version set to v0.65

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from New to 12

pushed wip-osd-readhole with some old incomplete work on this. here's a brain dump of where my thinking is/was on this.

the basic idea is that the primary will stop servicing reads once if it hasn't gotten an ping ack from its replicas in heartbeat_interval seconds. if the pg mapping changes but no osds go down, this is not really a problem; the peering messages that get exchanged ensure the peer has the latest map. but in the case that an osd goes down, we want to know that the peer osd saw that map, or the best upper bound on when the last replica ack it could have gotten was.

1. block reads after heartbeat_interval seconds without a replica ack.

2. after peering, wait heartbeat_interval seconds after our upper bound on when the last interval ended. initially assume this is now. that is technically sufficient to close the hole, but will introduce long delays each time a new peering interval starts.

down osds are the culprit. There are 3 cases to consider:

1. normal osd failure -- the last ack sent to the failed osd by old replicas needs to be communicated to the new primary.

- keep track of when last hb was acked for all hb peers
 - share that with the primary during peering.
 - make primary use that information to try to build a lower upper bound on the last ack the old primary could have received.

2. osd marks itself down -- new primaries needs to know it knew it was going down.

- mark this in the osdmap somehow?

3. 'ceph osd down NNN' or wrongly marked down.

- after we are (wrongly) marked down, keep answering pings on the old hb interface for heartbeat interval seconds.
 - in hb acks, share our min(osdmap epoch) across pgs
 - make replicas share old primary min_pg_epoch_consumed value with new primaries?

unfortunately lots of different threads here, all aiming to build a better upper bound on when the down osd must have stopped processing reads.

Actions

Copy link