Project

General

Profile

Actions

Bug #5173

closed

ceph scrub found missing pg object

Added by Ivan Kudryavtsev almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm using ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
All data is 3-times replicated (pools Size = 3):

Got last day:

2013-05-26 07:25:45.627279 7f2279192700 0 log [ERR] : 2.df osd.35 missing 128ef5df/rb.0.3573.238e1f29.00000010d0cd/head//2
2013-05-26 07:25:45.627283 7f2279192700 0 log [ERR] : 2.df osd.11 missing 128ef5df/rb.0.3573.238e1f29.00000010d0cd/head//2
2013-05-26 07:38:23.290418 7f2279192700 0 log [ERR] : 2.df deep-scrub stat mismatch, got 8101/8102 objects, 0/0 clones, 9758007408/9758011504 bytes.
2013-05-26 07:38:23.290472 7f2279192700 0 log [ERR] : 2.df deep-scrub 1 missing, 0 inconsistent objects
2013-05-26 07:38:23.290476 7f2279192700 0 log [ERR] : 2.df deep-scrub 3 errors

ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.df is active+clean+inconsistent, acting [35,11,18]
1 scrub errors

As I can understand from log. It was unable to found file on 2 or 3 nodes. How this could be, it's very-very small probability that two osd will fail simultaneously, aren't? And how to fix?

I found:

root@ceph-osd-3-1:/srv/ceph/osd35/current/2.df_head/DIR_F/DIR_D/DIR_5# ls rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2
rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2

root@ceph-osd-3-1:/srv/ceph/osd35/current/2.df_head/DIR_F/DIR_D/DIR_5# stat rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2
File: «rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2»
Size: 4096 Blocks: 16 IO Block: 4096 ????? ??
Device: 8d0h/2256d Inode: 2240333280 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-05-26 14:19:31.917155860 +0700
Modify: 2013-05-26 14:19:32.353155849 +0700
Change: 2013-05-26 14:19:32.369155851 +0700

root@ceph-osd-2-1:/srv/ceph/osd18/current/2.df_head/DIR_F/DIR_D/DIR_5/DIR_F# stat rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2
File: «rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2»
Size: 4096 Blocks: 16 IO Block: 4096 ????? ??
Device: fe08h/65032d Inode: 1373027321 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-04-06 19:55:14.751128701 +0700
Modify: 2013-04-06 19:55:14.751128701 +0700
Change: 2013-04-06 19:55:14.751128701 +0700

root@ceph-osd-1-1:/srv/ceph/osd11/current/2.df_head/DIR_F/DIR_D/DIR_5/DIR_F# stat rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2
File: «rb.0.3573.238e1f29.00000010d0cd__head_128EF5DF__2»
Size: 4096 Blocks: 16 IO Block: 4096 ????? ??
Device: fe03h/65027d Inode: 1342187995 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2013-05-26 14:19:45.324109104 +0700
Modify: 2013-05-26 14:19:45.356109105 +0700
Change: 2013-05-26 14:19:45.360109105 +0700

It exists on all OSD devices. What's wrong?

Actions #1

Updated by Ivan Kudryavtsev almost 11 years ago

All files have equal md5 sums equal to:

620f0b67a91f7f74151bc5be745b7110
Actions #2

Updated by Ivan Kudryavtsev almost 11 years ago

Run ceph pg repair 2.df

Finally, I umounted all osds one by one and checked XFS and mounted back with barriers (were nobarrier before).
After mounting, run again repair and it worked ok. PG repaired.

Actions #3

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF