Project

General

Profile

Actions

Documentation #61739

open

Improve "Repairing PG Inconsistencies" page

Added by Niklas Hambuechen 11 months ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
documentation
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

My cluster currently has "1 scrub errors" and "1 pg inconsistent" so I'm reading the docs at

https://docs.ceph.com/en/quincy/rados/operations/pg-repair/

and hit a couple roadblocks that I find make the docs confusing to the administrator. I'll list them here in the hope that somebody can improve them.

1. Suggested commands for diagnosis do not have expected output

To see a list of inconsistent PGs, run the following command:

rados list-inconsistent-pg {pool}

OK, that prints

rados list-inconsistent-pg mypool
["2.87"]

To see a list of inconsistent RADOS objects, run the following command:

rados list-inconsistent-obj {pgid}

# rados list-inconsistent-obj 2.87        
No scrub information available for pg 2.87
error 2: (2) No such file or directory

Uh, what now, what does that mean?

Also, if there's no scrub information, how did it arrive at "1 scrub errors" int eh first place?

2. "More Information on PG Repair" is unclear

I find the wording of https://docs.ceph.com/en/quincy/rados/operations/pg-repair/#more-information-on-pg-repair very difficult to follow.

For example,

This material is relevant for Filestore, but not for BlueStore

Which material? The text that follows? Or everything that I just read above? Unclear.

Only one of the possible cases is consistent.

Which cases? Are there multiple cases being listed somewhere, but they aren't enumerated properly or some formatting got lost?

If pg repair finds an inconsistent replicated pool, it marks the inconsistent copy as missing

So far in the terminology, it was PGs that could be inconsistent. What is an "inconsistent pool"? One with at least one inconsistent PG?

What's "the inconsistent copy" referring to? A copy of the pool? Also what does it mean for it to be missing / what is one, or Ceph, supposed to do then?

In the case of replicated pools, recovery is beyond the scope of `pg repair`.

So what then is in the scope of `pg repair`? Only recovery of EC pools? Would be nice to say that first.

It would be also nice to say what one should do when one has a replicated pool.

The whole object is not read while the checksum is recalculated.

Very unclear to me what this means. Is it read afterwards? Or does this mean the object "cannot be read" while checksum calculation has a lock on it?

I also find that the whole section doesn't explain what `pg repair` does; multiple issues with that:

If pg repair finds an inconsistent PG, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy

OK, so the digest is fixed, but what about the data itself? Is that also copied/corrected somehow?

Further, my case, apparently first the disk gave some read error (supposedly creating the inconsistency), and then completely died.

In this case, what does "overwrite the digest of the inconsistent copy" mean? The disk is gone so we cannot overwrite any digests on it.

I hope these questions can help a Ceph developer to add more structure and clarity into this part of the documentation.

Actions

Also available in: Atom PDF