Project

General

Profile

Bug #14521

Failure on restart after repairing corrupted PG

Added by Evgeniy Firsov about 5 years ago. Updated about 5 years ago.

Status:
Won't Fix
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
repair, meta, corruption, file exists
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Test case:
1. Start 2 OSDs, 8 PGs, pool size = 2
2. Run write workload for some time.
3. Stop workload
4. rm -rf dev/osd0/current/0.2_head/*
5. ceph osd scrub 0
6. ceph pg repair 0.2
7. Restart OSD.
8. Get "error (17) File exists not handled on operation"

The root cause is that "head" meta file wasn't restored by pg repair. So
all omap_get/setkeys fail for that PG.

On restart load_pgs skips that PG because it can't read metadata, but later when
OSD tries to recreate PG it hit the error from the test case, because
all the data files are in place, restored by repair.

History

#1 Updated by Sage Weil about 5 years ago

  • Status changed from New to Won't Fix

we don't plan to handle this level of damage. recreate the osd.

#2 Updated by Evgeniy Firsov about 5 years ago

The problem is that corruption is silent, scrub and repair doesn't report any errors. After a year of run all replicas may get affected, so short, planned downtime may turn into disaster, where no node can start and there is no replicas to recover from.

Leave the fix here for reference: https://github.com/ceph/ceph/pull/7470

Also available in: Atom PDF