Bug #18178: Unfound objects lost after OSD daemons restarted - RADOS - Ceph

Actions

Copy link

Bug #18178

closed

Unfound objects lost after OSD daemons restarted

Added by shawn y over 7 years ago. Updated about 6 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

David Zafman

Category:

Scrub/Repair

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Steps to reproduce in both Hammer and Jewel:
1. Create an EC pool, in my case, named ecpool-01, in k3+m2;

2. Fill ecpool-01 with some data;

3. Go into any OSD xxx/current/ data folder and pick any one object id, in the ec pool ID, such as object 1717 from

/var/lib/ceph/osd/osd0/current/7.35s3_head/object1717__head_20CC4775__7_ffffffffffffffff_3

where 7 is ecpool-01's pool ID, 7.35 is pg ID;

4. Look up the same object in all OSDs, such as:

#osd0
/var/lib/ceph/osd/osd0/current/7.35s3_head/object1717__head_20CC4775__7_ffffffffffffffff_3
# osd1
/var/lib/ceph/osd/osd1/current/7.35s1_head/object1717__head_20CC4775__7_ffffffffffffffff_1
# osd2
/var/lib/ceph/osd/osd2/current/7.35s4_head/object1717__head_20CC4775__7_ffffffffffffffff_4
# osd3
/var/lib/ceph/osd/osd3/current/7.35s0_head/object1717__head_20CC4775__7_ffffffffffffffff_0
# osd5
/var/lib/ceph/osd/osd5/current/7.35s2_head/object1717__head_20CC4775__7_ffffffffffffffff_2

5. Mess up at least 3 objects, for example, empty object shard #0, 111 bytes in shard #1, 222 bytes in shard #2;

6. Run deep scrub on the pg, 7.35;

7. Observe "pgs inconsistent" is correctly marked in "ceph status";

8. Run repair on the pg 7.35;

9. Observe 1 "unfound" object is correctly marked in "ceph status", and all 3 messed shards are in the modified size, without any fix or touch;

10. Reboot OSDs, as some suggested as a fix, such as in http://tracker.ceph.com/issues/15006;

11. Observe no error message in "ceph status", the cluster shows healthy, but all 3 messed shards are in the modified size, without any fix or touch;

12. Repeat deep scrub and repair again, and see the same errors coming back.

The real issue here is that a daemon restart should not wipe out error states and messages in Step 10.

And a minor complaint is that "unfound object" message is misleading. Unfound here actually means "Ceph cannot find CORRECT shards to repair the broken object". It confused some ones, and they claimed that they could find the (broken) objects.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #18178

Unfound objects lost after OSD daemons restarted

Updated by Samuel Just over 7 years ago

Updated by David Zafman over 7 years ago

Updated by Greg Farnum almost 7 years ago

Updated by David Zafman about 6 years ago

Updated by David Zafman about 6 years ago