Bug #14521: Failure on restart after repairing corrupted PG - Ceph - Ceph

Actions

Copy link

Bug #14521

closed

Failure on restart after repairing corrupted PG

Added by Evgeniy Firsov about 8 years ago. Updated about 8 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

Evgeniy Firsov

Category:

Target version:

% Done:

Source:

other

Tags:

repair, meta, corruption, file exists

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Test case:
1. Start 2 OSDs, 8 PGs, pool size = 2
2. Run write workload for some time.
3. Stop workload
4. rm -rf dev/osd0/current/0.2_head/*
5. ceph osd scrub 0
6. ceph pg repair 0.2
7. Restart OSD.
8. Get "error (17) File exists not handled on operation"

The root cause is that "head" meta file wasn't restored by pg repair. So
all omap_get/setkeys fail for that PG.

On restart load_pgs skips that PG because it can't read metadata, but later when
OSD tries to recreate PG it hit the error from the test case, because
all the data files are in place, restored by repair.

Actions

Copy link

Updated by Sage Weil about 8 years ago

Status changed from New to Won't Fix

we don't plan to handle this level of damage. recreate the osd.

Actions

Copy link

Updated by Evgeniy Firsov about 8 years ago

The problem is that corruption is silent, scrub and repair doesn't report any errors. After a year of run all replicas may get affected, so short, planned downtime may turn into disaster, where no node can start and there is no replicas to recover from.

Leave the fix here for reference: https://github.com/ceph/ceph/pull/7470

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14521

Failure on restart after repairing corrupted PG

Updated by Sage Weil about 8 years ago

Updated by Evgeniy Firsov about 8 years ago