Project

General

Profile

Fix #6109

pg <pgid> mark_unfound_lost fails if a completely-gone OSD still in map

Added by Dan Mick over 10 years ago. Updated about 1 year ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

cluster on mira045 et. al. had bad disk on osd.25; marked out, much data extracted, but for some
reason one pgid (2.1b7) wouldn't recover. osd.25 taken down; mark_unfound_lost revert tried to repair;
fails with

Error EINVAL: pg has 32 objects but we haven't probed all sources, not marking lost

apparently because the OSDmap still thinks osd.25 is a possible source, even though it's no longer
in crush and in fact has been "osd rm"ed.

History

#1 Updated by Dan Mick over 10 years ago

  • Category set to OSDMap
  • Assignee set to Samuel Just
  • Source changed from other to Development

#2 Updated by Sage Weil over 10 years ago

  • Priority changed from Normal to High

#3 Updated by Sage Weil over 10 years ago

  • Target version set to v0.69

#4 Updated by Sage Weil over 10 years ago

  • Tracker changed from Bug to Fix

#5 Updated by Sage Weil over 10 years ago

  • translation missing: en.field_story_points set to 3.00

#6 Updated by Ian Colle over 10 years ago

  • Target version changed from v0.69 to v0.70

#7 Updated by Samuel Just over 10 years ago

  • Target version deleted (v0.70)

#8 Updated by Samuel Just about 10 years ago

  • Assignee deleted (Samuel Just)

#9 Updated by Loïc Dachary over 9 years ago

Is there a known workaround ?

#10 Updated by Loïc Dachary over 9 years ago

Workaround suggested by Craig Lewis : recreate the OSDs that Ceph wants to probe. It doesn't have to have anything on it, it's probably better if it doesn't. Even ceph osd lost 2 won't help; Ceph won't mark the data lost until it's exhausted all possibilities.

#11 Updated by Sébastien Han over 9 years ago

I'm having a similar issue, I have one unfound object that I can't delete. I'm also getting the "Error EINVAL: pg has 32 objects but we haven't probed all sources, not marking lost" message.

Everything runs on 0.80.5

ceph pg 3.380 list_missing { "offset": { "oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -1,
"namespace": ""},
"num_missing": 1,
"num_unfound": 1,
"objects": [ { "oid": { "oid": "rbd_data.1982746cc8388.000000000000034c",
"key": "",
"snapid": -2,
"hash": 959071104,
"max": 0,
"pool": 3,
"namespace": ""},
"need": "3459'1083816",
"have": "3405'1083704",
"locations": []}],

While triggering "ceph pg 3.380 mark_unfound_lost revert", OSDs responsable for this object crash.

osdmap e5797 pool 'vms' (3) object 'rbd_data.1982746cc8388.000000000000034c' -> pg 3.392a4380 (3.380) -> up ([15,5,10], p15) acting ([1,6,9], p1)

OSD dump:
http://pastebin.com/QkwyStZM

#12 Updated by shawn chen about 8 years ago

@Samuel Just, I also met this problem, has this been solved ?

#13 Updated by Patrick Donnelly about 5 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSDMap)
  • Component(RADOS) Monitor added

#14 Updated by Raimund Sacherer about 1 year ago

Hello,

I just had a customer facing this same issue, and to have it on the record, at least since luminous marking the OSD lost will work and you can mark_unfound_lost after.

So I assume you can close this ticket now!

Thank you,
best regards
Raimund Sacherer

Also available in: Atom PDF