Bug #8305: objecter, osd: pool overlay change should trigger op resend - Ceph - Ceph

Actions

Copy link

Bug #8305

closed

objecter, osd: pool overlay change should trigger op resend

Added by Sage Weil almost 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

If the client is sending ops a, b, c, d, and a map is received changing the overlay, ordering can break. For example,

- get map, overlay = cache
 ...
 - send a to cache
 - get map, overlay = none
 - send b, c, d to base
 - get reply b, c, d
 - get reply a (redirect)

Instead, osd should discard ops from before the last overlay change, and client should resend.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

I think cache mode changes will cause similar problems. Let's add a pg_pool_t epoch_t that indicates the last policy change (whether it is the overlay or cache_mode or whatever) and we will resend (client) or discard (server) based on that.

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

I don't think this lets us handle arbitrary changes in the overlay system. Consider two clients a and b, a cache OSD, and a backing OSD. You can still get consistency issues if b and backing OSD see the overlay change before a and cache OSD do, while IO is in-progress.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Greg Farnum wrote:

I don't think this lets us handle arbitrary changes in the overlay system. Consider two clients a and b, a cache OSD, and a backing OSD. You can still get consistency issues if b and backing OSD see the overlay change before a and cache OSD do, while IO is in-progress.

If you are talking about going from overlay=cache and mode forward to no overlay, I think it is fine because the cache should be empty.

On the other hand, if we are going from no overlay to overlay=cache, the first write into the cache will trigger a promote which will ensure the base osd knows about the overlay change and no read can occur after a write.

There are surely other combinations we haven't considered, but for now I'm primarily worried about the target use cases of adding and removing a writeback cache...

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Sage Weil wrote:

I think cache mode changes will cause similar problems. Let's add a pg_pool_t epoch_t that indicates the last policy change (whether it is the overlay or cache_mode or whatever) and we will resend (client) or discard (server) based on that.

Perhaps even simpler (and more flexible) would be:

epoch_t last_force_interval;   ///< force a new interval from this epoch (including resent ops)

This makes the OSD and Objecter logic simple and reusable for other purposes.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Discussed in standup and decided on alternate approach:

epoch_t last_force_op_resend;  ///< last epoch in which we force clients to resend ops.

and a matching feature bit so that ops from old clients that don't understand this don't get their ops discarded.

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from In Progress to 7

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Status changed from 7 to Fix Under Review

Actions

Copy link

#10

Updated by Sage Weil almost 10 years ago

Status changed from Fix Under Review to 7

Actions

Copy link

#11

Updated by Sage Weil almost 10 years ago

Status changed from 7 to Pending Backport

Actions

Copy link

#12

Updated by Sage Weil almost 10 years ago

Status changed from Pending Backport to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8305

objecter, osd: pool overlay change should trigger op resend

Updated by Sage Weil almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago

Updated by Sage Weil almost 10 years ago