Improvements to auto repair
We should allow auto repair for bluestore pools since it has built in checksums. Currently, we are limited to erasure coded pools.
In order to trigger a auto repair when regular scrub detects errors, any errors should immediately schedule a deep-scrub.
Add a new pg state flag "failed_repair" when repairs can't fix all errors. This may be tricky to implement because pg repair ends as a recovery operation.
Set failed_repair if primary repair triggered by a client read fails.
Add a count of number of objects that are repaired to PG stats and OSD stats.
#3 Updated by David Zafman almost 3 years ago
I don't think we need to set "failed_repair" if primary can't recover itself on a read error. We are already setting "recovery_unfound" PG state.
If a primary read gets an EIO for example, but we are unable to read another replica this is the resulting PG:
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
1.0 2 1 1 0 1 3138 0 0 2 2 active+recovery_unfound+degraded 2019-03-14 09:22:36.159974 11'2 16:27 [1,0] 1 [1,0] 1 0'0 2019-03-14 09:21:56.003113 0'0 2019-03-14 09:21:56.003113 0