Bug #51254: deep-scrub stat mismatch on last PG in pool - RADOS - Ceph

Actions

Copy link

Bug #51254

open

deep-scrub stat mismatch on last PG in pool

Added by Andras Pataki almost 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Tiering

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.20

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In the past few weeks, we got inconsistent PGs in deep-scrub a few times, always on the very last PG in the pool:

[root@popeye-mon-0-07 ~]# ceph --version
ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)

[root@popeye-mon-0-07 ~]# ceph pg ls inconsistent
PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
1.1fff 66209 0 0 0 108207801181 0 0 3049 active+clean+inconsistent 20h 307780'5182385 307780:6211466 [780,1444,273]p780 [780,1444,273]p780 2021-06-15 15:57:11.670864 2021-06-13 19:42:47.051981

pool 1 has 8192 PGs, so 1fff is exactly the last PG. In all instances I've seen this, it has been the case.

Repeating the deep-scrub also shows the same error:

[root@popeye-mon-0-07 ~]# ceph pg deep-scrub 1.1fff

2021-06-16 12:47:23.008 7fffd2d89700 0 log_channel(cluster) log [DBG] : 1.1fff deep-scrub starts
2021-06-16 13:17:27.430 7fffd2d89700 -1 log_channel(cluster) log [ERR] : 1.1fff deep-scrub : stat mismatch, got 66234/66232 objects, 16496/16495 clones, 66234/66232 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 16483/16482 whiteouts, 108311671045/108307476741 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2021-06-16 13:17:27.430 7fffd2d89700 -1 log_channel(cluster) log [ERR] : 1.1fff deep-scrub 1 errors

Repairing the PG works correctly:

[root@popeye-mon-0-07 ~]# ceph pg repair 1.1fff

2021-06-16 13:59:14.404 7fffd2d89700 0 log_channel(cluster) log [DBG] : 1.1fff repair starts
2021-06-16 14:30:39.399 7fffd2d89700 -1 log_channel(cluster) log [ERR] : 1.1fff repair : stat mismatch, got 66247/66245 objects, 16496/16495 clones, 66247/66245 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 16483/16482 whiteouts, 108365838605/108361644301 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
2021-06-16 14:30:39.399 7fffd2d89700 -1 log_channel(cluster) log [ERR] : 1.1fff repair 1 errors, 1 fixed

We had so far 4 instances of this that I have records of - always the last PG in the pool.

Actions

Copy link

Updated by Neha Ojha almost 3 years ago

Category changed from Scrub/Repair to Tiering

It seems like you are using cache tiering, and there has been similar bugs reported like this. I don't understand why it would only affect the last PG in the pool though.

Actions

Copy link

Updated by Andras Pataki almost 3 years ago

We definitely do not use cache tiering on any of our clusters. On the cluster above, we do use snapshots (via cephfs) - we create one daily snapshot, and remove the oldest one - so there are rolling snaptrims on PGs. I have seen this on our other large cluster as well that doesn't use snapshots. Each OSD is on a spinning disk with db/wal on an NVMe partition.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #51254

deep-scrub stat mismatch on last PG in pool

Updated by Neha Ojha almost 3 years ago

Updated by Andras Pataki almost 3 years ago