Project

General

Profile

Actions

Bug #12577

closed

Inconsistent PGs that ceph pg repair does not fix

Added by Andras Pataki almost 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Summary: I am having problems with inconsistent PG's that the 'ceph pg repair' command does not fix.  Below are the details.  Any help would be appreciated.

# I am using ceph 0.94.2 on all machines:
~# ceph-osd -v
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

# Find the inconsistent PG's
~# ceph pg dump | grep inconsistent
dumped all in format plain
2.439 4208
0 0
0 0
17279507143 3103
3103 active+clean+inconsistent
2015-08-03 14:49:17.292884
77323'2250145  77480:890566 [78,54]
78 [78,54]
78 77323'2250145
2015-08-03 14:49:17.292538
77323'2250145  2015-08-03 14:49:17.292538
2.8b9 4083
0 0
0 0
16669590823 3051
3051 active+clean+inconsistent
2015-08-03 14:46:05.140063
77323'2249886  77473:897325 [7,72]
7 [7,72]
7 77323'2249886
2015-08-03 14:22:47.834063
77323'2249886  2015-08-03 14:22:47.834063

# Look at the first one:
~# ceph pg deep-scrub 2.439
instructing pg 2.439 on osd.78 to deep-scrub

# The logs of osd.78 show:
2015-08-03 15:16:34.409738 7f09ec04a700  0 log_channel(cluster) log [INF] : 2.439 deep-scrub starts
2015-08-03 15:16:51.364229 7f09ec04a700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.439 b029e439/10000022d93.00000f0c/head//2 on disk data digest 0xb3d78a6e != 0xa3944ad0
2015-08-03 15:16:52.763977 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 2.439 deep-scrub 1 errors

# Finding the object in question:
~# find ~ceph/osd/ceph-78/current/2.439_head -name 10000022d93.00000f0c* -ls
21510412310 4100 -rw-r--r--   1 root     root      4194304 Jun 30 17:09 /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2
~# md5sum /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2
4e4523244deec051cfe53dd48489a5db  /var/lib/ceph/osd/ceph-78/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2

# The object on the backup osd:
~# find ~ceph/osd/ceph-54/current/2.439_head -name 10000022d93.00000f0c* -ls
6442614367 4100 -rw-r--r--   1 root     root      4194304 Jun 30 17:09 /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2
~# md5sum /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2
4e4523244deec051cfe53dd48489a5db  /var/lib/ceph/osd/ceph-54/current/2.439_head/DIR_9/DIR_3/DIR_4/DIR_E/10000022d93.00000f0c__head_B029E439__2

# They don't seem to be different.
# When I try repair:
~# ceph pg repair 2.439
instructing pg 2.439 on osd.78 to repair

# The osd.78 logs show:
2015-08-03 15:19:21.775933 7f09ec04a700  0 log_channel(cluster) log [INF] : 2.439 repair starts
2015-08-03 15:19:38.088673 7f09ec04a700 -1 log_channel(cluster) log [ERR] : repair 2.439 b029e439/10000022d93.00000f0c/head//2 on disk data digest 0xb3d78a6e != 0xa3944ad0
2015-08-03 15:19:39.958019 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 2.439 repair 1 errors, 0 fixed
2015-08-03 15:19:39.962406 7f09ec04a700  0 log_channel(cluster) log [INF] : 2.439 deep-scrub starts
2015-08-03 15:19:56.510874 7f09ec04a700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.439 b029e439/10000022d93.00000f0c/head//2 on disk data digest 0xb3d78a6e != 0xa3944ad0
2015-08-03 15:19:58.348083 7f09ec04a700 -1 log_channel(cluster) log [ERR] : 2.439 deep-scrub 1 errors

The inconsistency is not fixed.
I have tried  a few other things:
 * Stop the primary osd, remove the object from the filesystem, restart the OSD and issue a repair.  It didn't work - it says that one object is missing, but did not copy it from the backup.
 * I tried the same on the backup (remove the file) - it also didn't get copied back from the primary in a repair.

Any help would be appreciated.

Thanks,

Andras
apataki@simonsfoundation.org

Files

osd-78.log.gz (982 KB) osd-78.log.gz Andras Pataki, 08/03/2015 10:03 PM

Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #12583: Inconsistent PGs that ceph pg repair does not fixResolvedDavid Zafman08/03/2015Actions
Actions

Also available in: Atom PDF