Project

General

Profile

Bug #23577

Inconsistent PG refusing to deep-scrub or repair

Added by David Turner almost 6 years ago. Updated almost 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
David Zafman
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is an issue brought over from the ceph-users Mailing List for a thread titled "Have an inconsistent PG, repair not working". There are 3 of us that have an inconsistent PG that we cannot get to run a new deep-scrub or repair on it. One of the users is using Bluestore for all of their OSDs. 9/11 of my OSDs are filestore while the other 2 are on bluestore (partway through a migration). This is my initial response outlining the diagnostics I've performed trying to resolve this on my own.

---
I'm running 12.2.2 and I have an EC PG with a scrub error. It has the same
output for [1] rados list-inconsistent-obj as mentioned before. This is the [2]
full health detail. This is the [3] excerpt from the log from the deep-scrub that marked the PG inconsistent. The scrub happened when the PG was starting up after using ceph-objectstore-tool to split its filestore subfolders using a [4] script that I've used for the better part of a year without any side effects.

I have tried quite a few things to get this PG to deep-scrub or repair, but to no avail. It will not do anything. I have set every osd's osd_max_scrubs to 0 in the cluster, waited for all scrubbing and deep scrubbing to finish, then increased the 11 OSDs for this PG to 1 before issuing a deep-scrub. And it will sit there for over an hour without deep-scrubbing. My current testing of this is to set all osds to 1, increase all of the osds for this PG to 4, and then issue the repair... but similarly nothing happens. Each time I issue the deep-scrub or repair, the output correctly says 'instructing pg 145.2e3 on osd.234 to repair', but nothing shows up in the log for the OSD and the PG state stays 'active+clean+inconsistent'.
---

[1] $ rados list-inconsistent-obj 145.2e3
No scrub information available for pg 145.2e3
error 2: (2) No such file or directory

[2] $ ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 145.2e3 is active+clean+inconsistent, acting [234,132,33,331,278,217,55,358,79,3,24]

[3] 2018-04-04 15:24:53.603380 7f54d1820700 0 log_channel(cluster) log [DBG] : 145.2e3 deep-scrub starts
2018-04-04 17:32:37.916853 7f54d1820700 -1 log_channel(cluster) log [ERR] : 145.2e3s0 deep-scrub 1 missing, 0 inconsistent objects
2018-04-04 17:32:37.916865 7f54d1820700 -1 log_channel(cluster) log [ERR] : 145.2e3 deep-scrub 1 errors

[4] https://gist.github.com/drakonstein/cb76c7696e65522ab0e699b7ea1ab1c4


Related issues

Related to RADOS - Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair Can't reproduce 04/06/2018

History

#1 Updated by David Turner almost 6 years ago

I have a second PG in the same cluster doing this exact same thing. One of it's 11 copies is on Bluestore, the rest are on filestore. What information or testing can I do to help figure this out?

#2 Updated by David Turner almost 6 years ago

I attempted to upload a log file with debug_osd = 20/20 for this with upload tag e6d4f641-3006-4ee9-86eb-359f569de6ed, but I'm uncertain if it was successful.

The log starts where I increased the debug_osd logging and ends where I reverted it ~10 seconds after issuing the deep-scrub to PG 145.2e3. The order that I performed things was setting osd_max_scrubs=1 to all osds, setting osd_max_scrubs=4 to the 11 osds belonging to this PG, and issuing the deep-scrub to the PG. Hopefully that makes it easier to parse through the file.

#3 Updated by David Zafman almost 6 years ago

  • Assignee set to David Zafman

#4 Updated by David Zafman almost 6 years ago

  • Related to Bug #23576: osd: active+clean+inconsistent pg will not scrub or repair added

#5 Updated by David Turner almost 6 years ago

This took a month for our deep-scrub cycle to complete, but eventually scrubs started working on these PGs on their own.

#6 Updated by David Zafman almost 6 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF