Bug #50558: Data loss propagation after backfill - RADOS - Ceph

Actions

Copy link

Bug #50558

closed

Data loss propagation after backfill

Added by Jin Hase about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus, octopus, pacific

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

41080

Crash signature (v1):

Crash signature (v2):

Description

Situation:
An OSD data loss has been propagated to other OSDs. If backfill is performed when shard is missing in a primary OSD, the shard that is corresponding to the shard in a primary OSD is also missing in the OSD to which the backfill is directed.
In case of 4+2 erasure coding, if copies are occurred against two OSDs during one backfill, three shards are missing(primary + two copies), making data recovery impossible.
This data loss depends on setting of erasure coding and the number of copies during backfill.

Environment:
- Ceph version: Master & Nautilus
- Erasure coding: 4+2
- Type: filestore

Step to Reproduce:
1. Setup more than 6 OSDs (with leaving some extra OSD out).
2. Store some object to pool.
3. Delete a file from a primary OSD in the PG.
(In fact, the shard on the primary OSD was unrecognized due to medium error of the primary OSD in the customer environment. To simulate this situation, run `rm`.)
e.g.) rm -f /var/lib/ceph/osd/ceph-7/current/1.0s0_head/<some
file>.04.21.09\:55\:*
4. Cause backfill in the PG.
This time, I could occur backfill by setting OSD to `in` from `out`.
e.g.) ceph osd in osd.5
5. ceph -s show active+clean status but object is lost on both primary and backfilled OSDs.

Estimated Cause:
If the medium error we got led to an incomplete readdir() result from XFS, then Ceph doesn't try to cope with that. However, we can detect readdir() error by checking the value of errno. So we can modify Ceph code to handle readdir() error.

Note:
When we modify and validate the code, the above reproduction procedure may not work. Since readdir() error does not occur with `rm` command, I think it is necessary to use systemtap to force readdir () error in order to simulate Medium error.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50558

Data loss propagation after backfill

Updated by Tomohiro Misono about 3 years ago

Updated by Kefu Chai about 3 years ago

Updated by Tomohiro Misono almost 3 years ago

Updated by Kefu Chai almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Backport Bot almost 3 years ago

Updated by Loïc Dachary almost 3 years ago