Bug #13704
closedInconsistent vs Down PG behavior is counterintuitive
0%
Description
Had a discussion about this with João at SUSEcon. I pointed out some Ceph behavior that I merely found odd, but João qualified it as a bug, so here goes. I'm tagging João as the assignee here just to make sure he gets a notification.
- Map an object with
ceph osd map <pool> <object>
. - Corrupt that object in the primary OSD, by overwriting it locally.
- Run
rados -p <pool> get <object> -
. - Observe that the call returns the corrupt object.
- Then, run
ceph pg scrub <pgid>
on that PG. - Note that the overall health state turns to
HEALTH_WARN
. - Run
rados -p <pool> get <object> -
again. - Observe that the call still returns the corrupt object, even though Ceph at this point already knows that it's corrupt.
- Run
ceph pg repair <pgid>
. - Observe the cluster state reverting to
HEALTH_OK
. - Run
rados -p <pool> get <object> -
the third time. - Observe that the call now returns the repaired object.
Compare this behavior to what happens when an PG goes down:
- Map an object with
ceph osd map <pool> <object>
. - Shut down its primary OSD.
- Modify the object with
rados -p <pool> put <object> - <<< whatever
- Shut down the object's other OSDs.
- Bring just the original primary OSD back online.
- Observe the cluster status changing to
HEALTH_ERR
and the PG being reported as Down. - Try to retrieve the object with
rados -p <pool> get <object> -
. - Observe that the
rados get
call blocks. - Bring one of the other OSDs back online.
- Observe that the
rados get
call now completes. - Observe the cluster reverting to the
HEALTH_OK
state.
While the sequence of events for a Down PG makes complete sense, the one for an Inconsistent OSD does not. At least after a scrub has detected the object to be corrupt, Ceph should be doing one of the following things, if the corrupt copy of the object is in the Primary OSD:
- Block all I/O to the corrupt object(s), or
- Move the Primary over to a different replica, just like it does when the OSD dies.
If conversely the corrupt copy is on a replica OSD, Ceph should probably prevent that replica from becoming primary until the PG is repaired, or (assuming the healthy primary fails and a corrupt replica takes over) block access to the corrupt object after promoting the replica to primary.
I hope this makes sense. :) Happy to clarify if there are questions.