Bug #13704
closedInconsistent vs Down PG behavior is counterintuitive
0%
Description
Had a discussion about this with João at SUSEcon. I pointed out some Ceph behavior that I merely found odd, but João qualified it as a bug, so here goes. I'm tagging João as the assignee here just to make sure he gets a notification.
- Map an object with
ceph osd map <pool> <object>
. - Corrupt that object in the primary OSD, by overwriting it locally.
- Run
rados -p <pool> get <object> -
. - Observe that the call returns the corrupt object.
- Then, run
ceph pg scrub <pgid>
on that PG. - Note that the overall health state turns to
HEALTH_WARN
. - Run
rados -p <pool> get <object> -
again. - Observe that the call still returns the corrupt object, even though Ceph at this point already knows that it's corrupt.
- Run
ceph pg repair <pgid>
. - Observe the cluster state reverting to
HEALTH_OK
. - Run
rados -p <pool> get <object> -
the third time. - Observe that the call now returns the repaired object.
Compare this behavior to what happens when an PG goes down:
- Map an object with
ceph osd map <pool> <object>
. - Shut down its primary OSD.
- Modify the object with
rados -p <pool> put <object> - <<< whatever
- Shut down the object's other OSDs.
- Bring just the original primary OSD back online.
- Observe the cluster status changing to
HEALTH_ERR
and the PG being reported as Down. - Try to retrieve the object with
rados -p <pool> get <object> -
. - Observe that the
rados get
call blocks. - Bring one of the other OSDs back online.
- Observe that the
rados get
call now completes. - Observe the cluster reverting to the
HEALTH_OK
state.
While the sequence of events for a Down PG makes complete sense, the one for an Inconsistent OSD does not. At least after a scrub has detected the object to be corrupt, Ceph should be doing one of the following things, if the corrupt copy of the object is in the Primary OSD:
- Block all I/O to the corrupt object(s), or
- Move the Primary over to a different replica, just like it does when the OSD dies.
If conversely the corrupt copy is on a replica OSD, Ceph should probably prevent that replica from becoming primary until the PG is repaired, or (assuming the healthy primary fails and a corrupt replica takes over) block access to the corrupt object after promoting the replica to primary.
I hope this makes sense. :) Happy to clarify if there are questions.
Updated by Sage Weil about 7 years ago
- Status changed from New to Closed
The down vs corrupted cases are different internally (hence the different behavior). EC pools handle corrupt local copies now correctly. for replicated pools, filestore doesn't know whether it is corrupt, and the scrub state is propagated in that way. bluestore, however, will return a csum error for bitrot. this is one of the last remaining todos for luminous. closing out this bug since we have a trello card to track it (and we'll forget to update this :).
Updated by Florian Haas about 7 years ago
It still doesn't make an awful lot of sense to me why corrupt objects would still be returned even after they have been detected as corrupt (i.e. post-scrub, pre-repair). ¯\_(ツ)_/¯