Bug #23267
scrub errors not cleared on replicas can cause inconsistent pg state when replica takes over primary
0%
Description
The PG_STATE_INCONSISTENT flag is set based on num_scrub_errors. A pg query can show after scrub inconsistencies repaired the non-primaries still showing num_scrub_errors > 0 in there local object_stat_sum_t. If a non-primary like that becomes primary the inconsistent pg state can re-appear until another scrub/deep-scrub clears num_scrub_errors.
Related issues
History
#1 Updated by David Zafman about 6 years ago
- Status changed from New to 12
#2 Updated by David Zafman about 6 years ago
- Status changed from 12 to In Progress
- Priority changed from Normal to Urgent
#3 Updated by David Zafman about 6 years ago
I reproduced this by creating an inconsistent pg and then causing it to split.
pool of size 2 with 1 pg and I created an inconsistency in one object (it had 2 scrub errors, size and data_digest_mismatch)
After pg_num/pgp_num to 8, there was only 1 resulting pg on a single replica with num_scrub_errors and the value was 1.
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 1.7 12 0 0 0 0 20048 101 101 active+clean 2018-03-21 19:46:17.366706 11'101 20:169 [1,2] 1 [1,2] 1 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.6 12 0 0 0 0 20048 101 101 active+clean 2018-03-21 19:46:15.176818 11'101 19:146 [1,0] 1 [1,0] 1 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.5 13 0 0 0 0 20048 101 101 active+clean 2018-03-21 19:46:18.414012 11'101 20:54 [2,0] 2 [2,0] 2 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.4 13 0 0 0 0 20049 101 101 active+clean 2018-03-21 19:46:15.181264 11'101 19:146 [1,0] 1 [1,0] 1 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.3 13 0 0 0 0 20049 101 101 active+clean 2018-03-21 19:46:18.724337 11'101 20:173 [1,2] 1 [1,2] 1 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.2 13 0 0 0 0 20049 101 101 active+clean 2018-03-21 19:46:16.187580 11'101 20:136 [0,1] 0 [0,1] 0 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.1 13 0 0 0 0 20049 101 101 active+clean 2018-03-21 19:46:19.378158 11'101 20:30 [2,0] 2 [2,0] 2 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0 1.0 12 0 0 0 0 20048 101 101 active+clean 2018-03-21 19:46:15.174846 11'101 19:147 [1,0] 1 [1,0] 1 11'101 2018-03-21 19:43:22.172793 11'101 2018-03-21 19:43:22.172793 0
Interestingly, all the primaries have num_scrub_errors == 0, but a single post-split pg replica has a non-zero value.
1.0 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.1 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.2 "num_scrub_errors": 0, "num_scrub_errors": 1, 1.3 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.4 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.5 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.6 "num_scrub_errors": 0, "num_scrub_errors": 0, 1.7 "num_scrub_errors": 0, "num_scrub_errors": 0,
#4 Updated by David Zafman almost 6 years ago
- Status changed from In Progress to Fix Under Review
- Backport set to luminous, jewel
#5 Updated by David Zafman almost 6 years ago
- Copied to Backport #23485: luminous: scrub errors not cleared on replicas can cause inconsistent pg state when replica takes over primary added
#6 Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #23486: jewel: scrub errors not cleared on replicas can cause inconsistent pg state when replica takes over primary added
#7 Updated by Greg Farnum almost 6 years ago
- Status changed from Fix Under Review to Pending Backport
#8 Updated by David Zafman almost 6 years ago
- Related to Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair added
#9 Updated by Yuri Weinstein almost 6 years ago
#10 Updated by David Zafman almost 6 years ago
- Status changed from Pending Backport to Resolved