Project

General

Profile

Backport #13060

osd: hammer: fail to start due to stray pgs after firefly->hammer upgrade

Added by Sage Weil over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Release:
hammer

Description

https://github.com/ceph/ceph/pull/5892

On Fri, 11 Sep 2015, Haomai Wang wrote:

On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil <> wrote:
On Fri, 11 Sep 2015, ?? wrote:

Thank Sage Weil:

1. I delete some testing pools in the past, but is was a long

time ago (may be 2 months ago), in recently upgrade, do not
delete pools.

2.? ceph osd dump please see the (attachment file

ceph.osd.dump.log)

3. debug osd = 20' and 'debug filestore = 20? (attachment file

ceph.osd.5.log.tar.gz)

This one is failing on pool 54, which has been deleted.? In this
case you
can work around it by renaming current/54.* out of the way.

4. i install the ceph-test, but output error
ceph-kvstore-tool /ceph/data5/current/db list
Invalid argument: /ceph/data5/current/db: does not exist

(create_if_missing is false)

Sorry, I should have said current/omap, not current/db.? I'm
still curious
to see the key dump.? I'm not sure why the leveldb key for these
pgs is
missing...

Yesterday I have a chat with wangrui and the reason is "infos"(legacy oid)
is missing. I'm not sure why it's missing.

Probably

https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908

Oh, I think I see what happened:

- the pg removal was aborted pre-hammer.  On pre-hammer, thsi means that
load_pgs skips it here:
https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L2121
- we upgrade to hammer.  we skip this pg (same reason), don't upgrade it,
but delete teh legacy infos object
https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
- now we see this crash...

I think the fix is, in hammer, to bail out of peek_map_epoch if the infos
object isn't present, here

https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L2867

Probably we should restructure so we can return a 'fail' value
instead of a magic epoch_t meaning the same...

Associated revisions

Revision f0c925e3 (diff)
Added by Sage Weil over 2 years ago

suites/rados/singleton-nomsgr/all/11429.yaml: double-hop and fix

- simplify this.. lots of extra cruft we don't need
- restart twice at hammer to ensure that we can still load pgs
post-upgrade
- do the same for the final version.

Fixes: #11429 (again, for ~infernalis)
Fixes: #13060
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil over 2 years ago

  • Status changed from Verified to Need Review

#2 Updated by Sage Weil over 2 years ago

reproduced the failure with a modified 11429.yaml... now verifying the fix.

#4 Updated by Sage Weil over 2 years ago

  • Status changed from Need Review to Resolved

#5 Updated by Loic Dachary over 2 years ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Assignee set to Sage Weil
  • Release hammer added

#6 Updated by Loic Dachary over 2 years ago

  • Target version set to v0.94.4

Also available in: Atom PDF