Project

General

Profile

Actions

Bug #13704

closed

Inconsistent vs Down PG behavior is counterintuitive

Added by Florian Haas over 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Had a discussion about this with João at SUSEcon. I pointed out some Ceph behavior that I merely found odd, but João qualified it as a bug, so here goes. I'm tagging João as the assignee here just to make sure he gets a notification.

  1. Map an object with ceph osd map <pool> <object>.
  2. Corrupt that object in the primary OSD, by overwriting it locally.
  3. Run rados -p <pool> get <object> -.
  4. Observe that the call returns the corrupt object.
  5. Then, run ceph pg scrub <pgid> on that PG.
  6. Note that the overall health state turns to HEALTH_WARN.
  7. Run rados -p <pool> get <object> - again.
  8. Observe that the call still returns the corrupt object, even though Ceph at this point already knows that it's corrupt.
  9. Run ceph pg repair <pgid>.
  10. Observe the cluster state reverting to HEALTH_OK.
  11. Run rados -p <pool> get <object> - the third time.
  12. Observe that the call now returns the repaired object.

Compare this behavior to what happens when an PG goes down:

  1. Map an object with ceph osd map <pool> <object>.
  2. Shut down its primary OSD.
  3. Modify the object with rados -p <pool> put <object> - <<< whatever
  4. Shut down the object's other OSDs.
  5. Bring just the original primary OSD back online.
  6. Observe the cluster status changing to HEALTH_ERR and the PG being reported as Down.
  7. Try to retrieve the object with rados -p <pool> get <object> -.
  8. Observe that the rados get call blocks.
  9. Bring one of the other OSDs back online.
  10. Observe that the rados get call now completes.
  11. Observe the cluster reverting to the HEALTH_OK state.

While the sequence of events for a Down PG makes complete sense, the one for an Inconsistent OSD does not. At least after a scrub has detected the object to be corrupt, Ceph should be doing one of the following things, if the corrupt copy of the object is in the Primary OSD:

  • Block all I/O to the corrupt object(s), or
  • Move the Primary over to a different replica, just like it does when the OSD dies.

If conversely the corrupt copy is on a replica OSD, Ceph should probably prevent that replica from becoming primary until the PG is repaired, or (assuming the healthy primary fails and a corrupt replica takes over) block access to the corrupt object after promoting the replica to primary.

I hope this makes sense. :) Happy to clarify if there are questions.

Actions

Also available in: Atom PDF