Bug #24526
Mimic OSDs do not start after deleting some pools with size=1
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After some amount of test actions involving creating pools with size=min_size=1 and then deleting them, most OSDs fail to start with the following message:
-1> 2018-06-14 18:21:50.273 7f00cc19b1c0 -1 osd.0 2402 init missing pg_pool_t for deleted pool 16 for pg 16.42; please downgrade to luminous and allow pg deletion to complete before upgrading
That wasn't an upgrade from Luminous so I think it's a bug: Mimic should not generate database states leading to creation of legacy pgs.
History
#1 Updated by Vitaliy Filippov almost 6 years ago
P.S: This happened just after deleting some pool with size=1 - several OSDs died immediately and the latest error message before `please downgrade to luminous` was:
2018-06-14 18:17:19.388 7f69f667a700 -1 /root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: In function 'int OSDService::get_deleted_pool_pg_num(int64_t)' thread 7f69f667a700 time 2018-06-14 18:17:19.385362 /root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: 1430: FAILED assert(r >= 0) ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f6a0e47d45f] 2: (()+0x284627) [0x7f6a0e47d627] 3: (()+0x3b539f) [0x55e072b5e39f] 4: (OSDService::identify_split_children(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, spg_t, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xea) [0x55e072b5e48a] 5: (OSDShard::identify_splits(std::shared_ptr<OSDMap const> const&, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xfb) [0x55e072b5ea1b] 6: (OSD::consume_map()+0x215) [0x55e072b6afa5] 7: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x640) [0x55e072b6c0f0] 8: (C_OnMapCommit::finish(int)+0x17) [0x55e072bc6c87] 9: (Context::complete(int)+0x9) [0x55e072b8b829] 10: (Finisher::finisher_thread_entry()+0x12e) [0x7f6a0e47b9de] 11: (()+0x7e25) [0x7f6a0afd7e25] 12: (clone()+0x6d) [0x7f6a0a0c8bad]
#2 Updated by Vitaliy Filippov almost 6 years ago
I solved this issue by monkey-patching OSD code:
From a6c789276d1897a6f2426638939ffe0e7853d4bc Mon Sep 17 00:00:00 2001 From: Vitaliy Filippov <vitalif@yourcmc.ru> Date: Thu, 14 Jun 2018 22:07:42 +0300 Subject: [PATCH 2/2] Ignore "pg without pool" errors --- src/osd/OSD.cc | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc index b026401..d72082b 100644 --- a/src/osd/OSD.cc +++ b/src/osd/OSD.cc @@ -1427,13 +1427,16 @@ int OSDService::get_deleted_pool_pg_num(int64_t pool) ghobject_t oid = OSD::make_final_pool_info_oid(pool); bufferlist bl; int r = store->read(meta_ch, oid, 0, 0, bl); - ceph_assert(r >= 0); - auto blp = bl.begin(); - pg_pool_t pi; - ::decode(pi, blp); - deleted_pool_pg_nums[pool] = pi.get_pg_num(); - dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl; - return pi.get_pg_num(); + if (r >= 0) { + auto blp = bl.begin(); + pg_pool_t pi; + ::decode(pi, blp); + deleted_pool_pg_nums[pool] = pi.get_pg_num(); + dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl; + } else { + deleted_pool_pg_nums[pool] = 0; + } + return deleted_pool_pg_nums[pool]; } OSDMapRef OSDService::_add_map(OSDMap *o) @@ -2512,7 +2515,7 @@ int OSD::init() << pgid.pool() << " for pg " << pgid << "; please downgrade to luminous and allow " << "pg deletion to complete before upgrading" << dendl; - ceph_abort(); +// ceph_abort(); } } } -- 1.8.3.1
But of course I should be very glad if you also fix it in mimic itself :)
#3 Updated by Igor Fedotov over 5 years ago
- Project changed from bluestore to RADOS