Project

General

Profile

Actions

Bug #18178

closed

Unfound objects lost after OSD daemons restarted

Added by shawn y over 7 years ago. Updated about 6 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
David Zafman
Category:
Scrub/Repair
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Steps to reproduce in both Hammer and Jewel:
1. Create an EC pool, in my case, named ecpool-01, in k3+m2;

2. Fill ecpool-01 with some data;

3. Go into any OSD xxx/current/ data folder and pick any one object id, in the ec pool ID, such as object 1717 from

/var/lib/ceph/osd/osd0/current/7.35s3_head/object1717__head_20CC4775__7_ffffffffffffffff_3

where 7 is ecpool-01's pool ID, 7.35 is pg ID;

4. Look up the same object in all OSDs, such as:

#osd0
/var/lib/ceph/osd/osd0/current/7.35s3_head/object1717__head_20CC4775__7_ffffffffffffffff_3
# osd1
/var/lib/ceph/osd/osd1/current/7.35s1_head/object1717__head_20CC4775__7_ffffffffffffffff_1
# osd2
/var/lib/ceph/osd/osd2/current/7.35s4_head/object1717__head_20CC4775__7_ffffffffffffffff_4
# osd3
/var/lib/ceph/osd/osd3/current/7.35s0_head/object1717__head_20CC4775__7_ffffffffffffffff_0
# osd5
/var/lib/ceph/osd/osd5/current/7.35s2_head/object1717__head_20CC4775__7_ffffffffffffffff_2

5. Mess up at least 3 objects, for example, empty object shard #0, 111 bytes in shard #1, 222 bytes in shard #2;

6. Run deep scrub on the pg, 7.35;

7. Observe "pgs inconsistent" is correctly marked in "ceph status";

8. Run repair on the pg 7.35;

9. Observe 1 "unfound" object is correctly marked in "ceph status", and all 3 messed shards are in the modified size, without any fix or touch;

10. Reboot OSDs, as some suggested as a fix, such as in http://tracker.ceph.com/issues/15006;

11. Observe no error message in "ceph status", the cluster shows healthy, but all 3 messed shards are in the modified size, without any fix or touch;

12. Repeat deep scrub and repair again, and see the same errors coming back.

The real issue here is that a daemon restart should not wipe out error states and messages in Step 10.

And a minor complaint is that "unfound object" message is misleading. Unfound here actually means "Ceph cannot find CORRECT shards to repair the broken object". It confused some ones, and they claimed that they could find the (broken) objects.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #18162: osd/ReplicatedPG.cc: recover_replicas: object added to missing set for backfill, but is not in recovering, error!ResolvedDavid Zafman12/06/2016

Actions
Actions

Also available in: Atom PDF