Bug #13704: Inconsistent vs Down PG behavior is counterintuitive - Ceph - Ceph

Actions

Copy link

Bug #13704

closed

Inconsistent vs Down PG behavior is counterintuitive

Added by Florian Haas over 8 years ago. Updated about 7 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Joao Eduardo Luis

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Had a discussion about this with João at SUSEcon. I pointed out some Ceph behavior that I merely found odd, but João qualified it as a bug, so here goes. I'm tagging João as the assignee here just to make sure he gets a notification.

Map an object with ceph osd map <pool> <object>.
Corrupt that object in the primary OSD, by overwriting it locally.
Run rados -p <pool> get <object> -.
Observe that the call returns the corrupt object.
Then, run ceph pg scrub <pgid> on that PG.
Note that the overall health state turns to HEALTH_WARN.
Run rados -p <pool> get <object> - again.
Observe that the call still returns the corrupt object, even though Ceph at this point already knows that it's corrupt.
Run ceph pg repair <pgid>.
Observe the cluster state reverting to HEALTH_OK.
Run rados -p <pool> get <object> - the third time.
Observe that the call now returns the repaired object.

Compare this behavior to what happens when an PG goes down:

Map an object with ceph osd map <pool> <object>.
Shut down its primary OSD.
Modify the object with rados -p <pool> put <object> - <<< whatever
Shut down the object's other OSDs.
Bring just the original primary OSD back online.
Observe the cluster status changing to HEALTH_ERR and the PG being reported as Down.
Try to retrieve the object with rados -p <pool> get <object> -.
Observe that the rados get call blocks.
Bring one of the other OSDs back online.
Observe that the rados get call now completes.
Observe the cluster reverting to the HEALTH_OK state.

While the sequence of events for a Down PG makes complete sense, the one for an Inconsistent OSD does not. At least after a scrub has detected the object to be corrupt, Ceph should be doing one of the following things, if the corrupt copy of the object is in the Primary OSD:

Block all I/O to the corrupt object(s), or
Move the Primary over to a different replica, just like it does when the OSD dies.

If conversely the corrupt copy is on a replica OSD, Ceph should probably prevent that replica from becoming primary until the PG is repaired, or (assuming the healthy primary fails and a corrupt replica takes over) block access to the corrupt object after promoting the replica to primary.

I hope this makes sense. :) Happy to clarify if there are questions.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #13704

Inconsistent vs Down PG behavior is counterintuitive

Updated by Sage Weil about 7 years ago

Updated by Florian Haas about 7 years ago