Project

General

Profile

Actions

Bug #19380

open

only sort of a bug: it's possible to get an unfound object without losing min_size objects due to destructive updates

Added by Samuel Just about 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Peering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Fundamentally, ReplicatedBackend does destructive updates. That makes the following sequence possible. Assume that the primary and the two replicas see an interval change with logs in the following state:

P : ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar)
R1: ... 10'10(foo) 10'11(bar)
R2: ... 10'10(foo) 10'11(bar)

After peering, we'll end up with

P : ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar)
R1: ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar) [missing: foo(10'12), bar(10'13)]
R2: ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar) [missing: foo(10'12), bar(10'13)]

If the primary fails at this point before doing recovery, we'll end up with

R1: ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar) [missing: foo(10'12), bar(10'13)]
R2: ... 10'10(foo) 10'11(bar) 10'12(foo) 10'13(bar) [missing: foo(10'12), bar(10'13)]

where (let's say) R1 is now primary. Since all available replicas are missing foo and bar, those two objects are now unfound even though we've only lost one replica. Worse, they really are unfound because we don't know that the primary didn't serve reads on that object immediately after going active, but before doing recovery.

On some level, this is part of the tradeoff of doing destructive updates. ECBackend doesn't suffer from this problem since we always choose the shortest log during peering and all replicas roll back to that point (no destructive updates, so rollback is always possible). You'd be tempted to suggest that ReplicatedBackend simply use the shortest log instead of the longest one, but that's actually just as bad since log entries cannot generally be rolled back without reading that copy of the object from another osd.

However, there are a few things that could be done to make it work better without going all the way over to how ECBackend does things:
1) Choose the most common last_update among acting set replicas if there is a repeat (above, choose 10'11 as last_update and recover the primary instead of choosing 10'13 and recovering both replicas)
2) Reduce the set of cases where this can happen by allowing replicas to finish consuming write from the previous interval before peering in the new one if the primary didn't change. This would be a peering protocol change and would require some thought.
3) Explicitly remember the oldest log we must honor. We can then choose to delay reads on objects whose most recent update is past that point until all log entries older than the newest entry on that object have at least min_size copies. This would be a pretty drastic change and would require a lot of thought as to how that local must-honor version is maintained.

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Peering
  • Component(RADOS) OSD added
Actions

Also available in: Atom PDF