Fix #6109
pg <pgid> mark_unfound_lost fails if a completely-gone OSD still in map
0%
Description
cluster on mira045 et. al. had bad disk on osd.25; marked out, much data extracted, but for some
reason one pgid (2.1b7) wouldn't recover. osd.25 taken down; mark_unfound_lost revert tried to repair;
fails with
Error EINVAL: pg has 32 objects but we haven't probed all sources, not marking lost
apparently because the OSDmap still thinks osd.25 is a possible source, even though it's no longer
in crush and in fact has been "osd rm"ed.
History
#1 Updated by Dan Mick over 10 years ago
- Category set to OSDMap
- Assignee set to Samuel Just
- Source changed from other to Development
#2 Updated by Sage Weil over 10 years ago
- Priority changed from Normal to High
#3 Updated by Sage Weil over 10 years ago
- Target version set to v0.69
#4 Updated by Sage Weil over 10 years ago
- Tracker changed from Bug to Fix
#5 Updated by Sage Weil over 10 years ago
- translation missing: en.field_story_points set to 3.00
#6 Updated by Ian Colle over 10 years ago
- Target version changed from v0.69 to v0.70
#7 Updated by Samuel Just over 10 years ago
- Target version deleted (
v0.70)
#8 Updated by Samuel Just about 10 years ago
- Assignee deleted (
Samuel Just)
#9 Updated by Loïc Dachary over 9 years ago
Is there a known workaround ?
#10 Updated by Loïc Dachary over 9 years ago
Workaround suggested by Craig Lewis : recreate the OSDs that Ceph wants to probe. It doesn't have to have anything on it, it's probably better if it doesn't. Even ceph osd lost 2 won't help; Ceph won't mark the data lost until it's exhausted all possibilities.
#11 Updated by Sébastien Han over 9 years ago
I'm having a similar issue, I have one unfound object that I can't delete. I'm also getting the "Error EINVAL: pg has 32 objects but we haven't probed all sources, not marking lost" message.
Everything runs on 0.80.5
ceph pg 3.380 list_missing
{ "offset": { "oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -1,
"namespace": ""},
"num_missing": 1,
"num_unfound": 1,
"objects": [
{ "oid": { "oid": "rbd_data.1982746cc8388.000000000000034c",
"key": "",
"snapid": -2,
"hash": 959071104,
"max": 0,
"pool": 3,
"namespace": ""},
"need": "3459'1083816",
"have": "3405'1083704",
"locations": []}],
While triggering "ceph pg 3.380 mark_unfound_lost revert", OSDs responsable for this object crash.
osdmap e5797 pool 'vms' (3) object 'rbd_data.1982746cc8388.000000000000034c' -> pg 3.392a4380 (3.380) -> up ([15,5,10], p15) acting ([1,6,9], p1)
OSD dump:
http://pastebin.com/QkwyStZM
#12 Updated by shawn chen about 8 years ago
@Samuel Just, I also met this problem, has this been solved ?
#13 Updated by Patrick Donnelly about 5 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSDMap) - Component(RADOS) Monitor added
#14 Updated by Raimund Sacherer about 1 year ago
Hello,
I just had a customer facing this same issue, and to have it on the record, at least since luminous marking the OSD lost will work and you can mark_unfound_lost after.
So I assume you can close this ticket now!
Thank you,
best regards
Raimund Sacherer