Project

General

Profile

Actions

Documentation #61739

open

Improve "Repairing PG Inconsistencies" page

Added by Niklas Hambuechen 11 months ago. Updated 10 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
documentation
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

My cluster currently has "1 scrub errors" and "1 pg inconsistent" so I'm reading the docs at

https://docs.ceph.com/en/quincy/rados/operations/pg-repair/

and hit a couple roadblocks that I find make the docs confusing to the administrator. I'll list them here in the hope that somebody can improve them.

1. Suggested commands for diagnosis do not have expected output

To see a list of inconsistent PGs, run the following command:

rados list-inconsistent-pg {pool}

OK, that prints

rados list-inconsistent-pg mypool
["2.87"]

To see a list of inconsistent RADOS objects, run the following command:

rados list-inconsistent-obj {pgid}

# rados list-inconsistent-obj 2.87        
No scrub information available for pg 2.87
error 2: (2) No such file or directory

Uh, what now, what does that mean?

Also, if there's no scrub information, how did it arrive at "1 scrub errors" int eh first place?

2. "More Information on PG Repair" is unclear

I find the wording of https://docs.ceph.com/en/quincy/rados/operations/pg-repair/#more-information-on-pg-repair very difficult to follow.

For example,

This material is relevant for Filestore, but not for BlueStore

Which material? The text that follows? Or everything that I just read above? Unclear.

Only one of the possible cases is consistent.

Which cases? Are there multiple cases being listed somewhere, but they aren't enumerated properly or some formatting got lost?

If pg repair finds an inconsistent replicated pool, it marks the inconsistent copy as missing

So far in the terminology, it was PGs that could be inconsistent. What is an "inconsistent pool"? One with at least one inconsistent PG?

What's "the inconsistent copy" referring to? A copy of the pool? Also what does it mean for it to be missing / what is one, or Ceph, supposed to do then?

In the case of replicated pools, recovery is beyond the scope of `pg repair`.

So what then is in the scope of `pg repair`? Only recovery of EC pools? Would be nice to say that first.

It would be also nice to say what one should do when one has a replicated pool.

The whole object is not read while the checksum is recalculated.

Very unclear to me what this means. Is it read afterwards? Or does this mean the object "cannot be read" while checksum calculation has a lock on it?

I also find that the whole section doesn't explain what `pg repair` does; multiple issues with that:

If pg repair finds an inconsistent PG, it attempts to overwrite the digest of the inconsistent copy with the digest of the authoritative copy

OK, so the digest is fixed, but what about the data itself? Is that also copied/corrected somehow?

Further, my case, apparently first the disk gave some read error (supposedly creating the inconsistency), and then completely died.

In this case, what does "overwrite the digest of the inconsistent copy" mean? The disk is gone so we cannot overwrite any digests on it.

I hope these questions can help a Ceph developer to add more structure and clarity into this part of the documentation.

Actions #1

Updated by Niklas Hambuechen 11 months ago

Regarding

# rados list-inconsistent-obj 2.87       
No scrub information available for pg 2.87
error 2: (2) No such file or directory

this seems to be this bug which was closed as "Can't reproduce" immediately 7 years ago, with many people reproducing it afterwards:

https://tracker.ceph.com/issues/15781

Could somebody reopen it?

Actions #2

Updated by Niklas Hambuechen 11 months ago

Some more troubles in the related page https://docs.ceph.com/en/quincy/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent

There is only one consistent state, but in the worst case, we could have different inconsistencies in multiple perspectives found in more than one objects.

Again this seems to refer to unintroduced terminology.

  • What's a "perspective" found in more than one object?

You can repair the inconsistent placement group by executing:

ceph pg repair {placement-group-ID}

Which overwrites the bad copies with the authoritative ones. In most cases, Ceph is able to choose authoritative copies from all available replicas using some predefined criteria. But this does not always work. For example, the stored data digest could be missing, and the calculated digest will be ignored when choosing the authoritative copies. So, please use the above command with caution.

First it tells to try `ceph pg repair`. But then it also says "use it with caution". It describes in which case caution is needed, but it does not explain how I can check whether this case is occurring in my cluster. Which commands to run, what logs to check?

Currently it's a bit "this command may help or screw you, but we don't tell you how you to check which one it will be".

Actions #3

Updated by Niklas Hambuechen 11 months ago

It would be also nice to say what one should do when one has a replicated pool.

I found one suggested answer here:

https://ceph-users.ceph.narkive.com/2OSM6EuW/scrub-error-how-does-ceph-pg-repair-work#post7

In the case that an inconsistent PG is caused by afailed disk read, you don't need to run ceph pg repair at all.
Instead, since your drive is bad, stop the osd process, mark that osd out.
After backfilling has completed and the PG is re-scrubbed, you will find it is consistent again.

Actions #4

Updated by Niklas Hambuechen 11 months ago

In the same thread https://ceph-users.ceph.narkive.com/2OSM6EuW/scrub-error-how-does-ceph-pg-repair-work

somebody is asking another question that I think should be documented:

What impact does it have on the cluster when an OSD is in "inconsistent" state?

Does that mean that any IO to this pg is blocked until it is repaired?

There is no answer in the thread.

I would imagine that it should not have an impact given that there are other replicas of the data, but it would be good for the docs to explain it explicitly.

From the post above, I think it also should be made super clear what Ceph is going do by itself to get rid of the inconsistency, and when the user needs to do something.

Actions #5

Updated by Niklas Hambuechen 10 months ago

I learned more info now.

The information in the "Repairing PG Inconsistencies" page is misleading.

In the thread

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GFUPBQAPKNKFTV3MPCFQZPE6J2H2NFYC/#X34ZMOUO37UNTUWSGO2PN2MMCHUTXWFU

I discovered that the correct way to fix a PG inconsistency after replacing a failed disk is to simply ensure that a scrub runs.

The issue there was that the `ceph pg deep-scrub` I invoked was never scheduled because Ceph always picked some other scrub to do first on the relevant OSD.
Increasing `osd_max_scrubs` beyond 1 made it possible to force the scrub to start.

I conclude that most of the information online, including the Ceph docs, does not give the correct advice when recommending `ceph pg repair`.
Instead, the docs should make clear that a scrub will fix such issues without involvement of `ceph pg repair`.

I think the docs should be improved, because a disk failing and being replaced is an extremely common operation for storage cluster.

I would usually make a PR to improve them, but unfortunately I still don't understand 90% of the rest of this docs section, so I do not feel equipped for it.

Actions

Also available in: Atom PDF