Project

General

Profile

Bug #50558

Data loss propagation after backfill

Added by Jin Hase 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus, octopus, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Situation:
An OSD data loss has been propagated to other OSDs. If backfill is performed when shard is missing in a primary OSD, the shard that is corresponding to the shard in a primary OSD is also missing in the OSD to which the backfill is directed.
In case of 4+2 erasure coding, if copies are occurred against two OSDs during one backfill, three shards are missing(primary + two copies), making data recovery impossible.
This data loss depends on setting of erasure coding and the number of copies during backfill.

Environment:
- Ceph version: Master & Nautilus
- Erasure coding: 4+2
- Type: filestore

Step to Reproduce:
1. Setup more than 6 OSDs (with leaving some extra OSD out).
2. Store some object to pool.
3. Delete a file from a primary OSD in the PG.
(In fact, the shard on the primary OSD was unrecognized due to medium error of the primary OSD in the customer environment. To simulate this situation, run `rm`.)
e.g.) rm -f /var/lib/ceph/osd/ceph-7/current/1.0s0_head/<some
file>.04.21.09\:55\:*
4. Cause backfill in the PG.
This time, I could occur backfill by setting OSD to `in` from `out`.
e.g.) ceph osd in osd.5
5. ceph -s show active+clean status but object is lost on both primary and backfilled OSDs.

Estimated Cause:
If the medium error we got led to an incomplete readdir() result from XFS, then Ceph doesn't try to cope with that. However, we can detect readdir() error by checking the value of errno. So we can modify Ceph code to handle readdir() error.

Note:
When we modify and validate the code, the above reproduction procedure may not work. Since readdir() error does not occur with `rm` command, I think it is necessary to use systemtap to force readdir () error in order to simulate Medium error.


Related issues

Copied to RADOS - Backport #50701: nautilus: Data loss propagation after backfill Resolved
Copied to RADOS - Backport #50702: pacific: Data loss propagation after backfill Resolved
Copied to RADOS - Backport #50703: octopus: Data loss propagation after backfill Resolved

History

#1 Updated by Tomohiro Misono 3 months ago

Hi

I worked with hase-san and submitted PR to handle readdir error correctly in filestore code: https://github.com/ceph/ceph/pull/41080
With this fix, I confirmed that readdir error (caused manually by systemtap) in backfill operation would lead primary osd down as expected.

Tomohiro

#2 Updated by Kefu Chai 3 months ago

  • Status changed from New to Fix Under Review
  • Backport set to nautilus, octopus, pacific
  • Pull request ID set to 41080

#3 Updated by Tomohiro Misono 3 months ago

For the record, the following is the sequence of the data loss propagation when readdir error happens on filestore during backfill:

(Assumption: media on primary is somehow broken and returns readdir error for certain object entries)

1. Backfill operation starts on primary (PrimaryLog::Recover_backfill)
2. Primary prepares object lists to be backfilled (PrimaryLog::upcate_range & PrimaryLog::scan_range). When filestore is used, primary accesses filesytem. Before the fix, readdir error is silently ignored and object lists would be incomplete in that case
3. Backfill operation starts based on the incompleted list
4. When backfill finishes and PG state becomes clean, primary sends PGRemove messages to replaced OSD. Then replaced OSD deletes files
5. As a result, objects which cannot be read on primary is not copied to backfill target OSD, and also removed from replaced OSD

#4 Updated by Kefu Chai 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#5 Updated by Backport Bot 3 months ago

  • Copied to Backport #50701: nautilus: Data loss propagation after backfill added

#6 Updated by Backport Bot 3 months ago

  • Copied to Backport #50702: pacific: Data loss propagation after backfill added

#7 Updated by Backport Bot 3 months ago

  • Copied to Backport #50703: octopus: Data loss propagation after backfill added

#8 Updated by Loïc Dachary about 2 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF