Project

General

Profile

Actions

Bug #13704

closed

Inconsistent vs Down PG behavior is counterintuitive

Added by Florian Haas over 8 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Had a discussion about this with João at SUSEcon. I pointed out some Ceph behavior that I merely found odd, but João qualified it as a bug, so here goes. I'm tagging João as the assignee here just to make sure he gets a notification.

  1. Map an object with ceph osd map <pool> <object>.
  2. Corrupt that object in the primary OSD, by overwriting it locally.
  3. Run rados -p <pool> get <object> -.
  4. Observe that the call returns the corrupt object.
  5. Then, run ceph pg scrub <pgid> on that PG.
  6. Note that the overall health state turns to HEALTH_WARN.
  7. Run rados -p <pool> get <object> - again.
  8. Observe that the call still returns the corrupt object, even though Ceph at this point already knows that it's corrupt.
  9. Run ceph pg repair <pgid>.
  10. Observe the cluster state reverting to HEALTH_OK.
  11. Run rados -p <pool> get <object> - the third time.
  12. Observe that the call now returns the repaired object.

Compare this behavior to what happens when an PG goes down:

  1. Map an object with ceph osd map <pool> <object>.
  2. Shut down its primary OSD.
  3. Modify the object with rados -p <pool> put <object> - <<< whatever
  4. Shut down the object's other OSDs.
  5. Bring just the original primary OSD back online.
  6. Observe the cluster status changing to HEALTH_ERR and the PG being reported as Down.
  7. Try to retrieve the object with rados -p <pool> get <object> -.
  8. Observe that the rados get call blocks.
  9. Bring one of the other OSDs back online.
  10. Observe that the rados get call now completes.
  11. Observe the cluster reverting to the HEALTH_OK state.

While the sequence of events for a Down PG makes complete sense, the one for an Inconsistent OSD does not. At least after a scrub has detected the object to be corrupt, Ceph should be doing one of the following things, if the corrupt copy of the object is in the Primary OSD:

  • Block all I/O to the corrupt object(s), or
  • Move the Primary over to a different replica, just like it does when the OSD dies.

If conversely the corrupt copy is on a replica OSD, Ceph should probably prevent that replica from becoming primary until the PG is repaired, or (assuming the healthy primary fails and a corrupt replica takes over) block access to the corrupt object after promoting the replica to primary.

I hope this makes sense. :) Happy to clarify if there are questions.

Actions #1

Updated by Sage Weil about 7 years ago

  • Status changed from New to Closed

The down vs corrupted cases are different internally (hence the different behavior). EC pools handle corrupt local copies now correctly. for replicated pools, filestore doesn't know whether it is corrupt, and the scrub state is propagated in that way. bluestore, however, will return a csum error for bitrot. this is one of the last remaining todos for luminous. closing out this bug since we have a trello card to track it (and we'll forget to update this :).

Actions #2

Updated by Florian Haas about 7 years ago

It still doesn't make an awful lot of sense to me why corrupt objects would still be returned even after they have been detected as corrupt (i.e. post-scrub, pre-repair). ¯\_(ツ)_/¯

Actions

Also available in: Atom PDF