Bug #24526
open
Mimic OSDs do not start after deleting some pools with size=1
Added by Vitaliy Filippov almost 6 years ago.
Updated almost 6 years ago.
Description
After some amount of test actions involving creating pools with size=min_size=1 and then deleting them, most OSDs fail to start with the following message:
-1> 2018-06-14 18:21:50.273 7f00cc19b1c0 -1 osd.0 2402 init missing pg_pool_t for deleted pool 16 for pg 16.42; please downgrade to luminous and allow pg deletion to complete before upgrading
That wasn't an upgrade from Luminous so I think it's a bug: Mimic should not generate database states leading to creation of legacy pgs.
P.S: This happened just after deleting some pool with size=1 - several OSDs died immediately and the latest error message before `please downgrade to luminous` was:
2018-06-14 18:17:19.388 7f69f667a700 -1 /root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: In function 'int OSDService::get_deleted_pool_pg_num(int64_t)' thread 7f69f667a700 time 2018-06-14 18:17:19.385362
/root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: 1430: FAILED assert(r >= 0)
ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f6a0e47d45f]
2: (()+0x284627) [0x7f6a0e47d627]
3: (()+0x3b539f) [0x55e072b5e39f]
4: (OSDService::identify_split_children(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, spg_t, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xea) [0x55e072b5e48a]
5: (OSDShard::identify_splits(std::shared_ptr<OSDMap const> const&, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xfb) [0x55e072b5ea1b]
6: (OSD::consume_map()+0x215) [0x55e072b6afa5]
7: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x640) [0x55e072b6c0f0]
8: (C_OnMapCommit::finish(int)+0x17) [0x55e072bc6c87]
9: (Context::complete(int)+0x9) [0x55e072b8b829]
10: (Finisher::finisher_thread_entry()+0x12e) [0x7f6a0e47b9de]
11: (()+0x7e25) [0x7f6a0afd7e25]
12: (clone()+0x6d) [0x7f6a0a0c8bad]
I solved this issue by monkey-patching OSD code:
From a6c789276d1897a6f2426638939ffe0e7853d4bc Mon Sep 17 00:00:00 2001
From: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Thu, 14 Jun 2018 22:07:42 +0300
Subject: [PATCH 2/2] Ignore "pg without pool" errors
---
src/osd/OSD.cc | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index b026401..d72082b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -1427,13 +1427,16 @@ int OSDService::get_deleted_pool_pg_num(int64_t pool)
ghobject_t oid = OSD::make_final_pool_info_oid(pool);
bufferlist bl;
int r = store->read(meta_ch, oid, 0, 0, bl);
- ceph_assert(r >= 0);
- auto blp = bl.begin();
- pg_pool_t pi;
- ::decode(pi, blp);
- deleted_pool_pg_nums[pool] = pi.get_pg_num();
- dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl;
- return pi.get_pg_num();
+ if (r >= 0) {
+ auto blp = bl.begin();
+ pg_pool_t pi;
+ ::decode(pi, blp);
+ deleted_pool_pg_nums[pool] = pi.get_pg_num();
+ dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl;
+ } else {
+ deleted_pool_pg_nums[pool] = 0;
+ }
+ return deleted_pool_pg_nums[pool];
}
OSDMapRef OSDService::_add_map(OSDMap *o)
@@ -2512,7 +2515,7 @@ int OSD::init()
<< pgid.pool() << " for pg " << pgid
<< "; please downgrade to luminous and allow "
<< "pg deletion to complete before upgrading" << dendl;
- ceph_abort();
+// ceph_abort();
}
}
}
--
1.8.3.1
But of course I should be very glad if you also fix it in mimic itself :)
- Project changed from bluestore to RADOS
Also available in: Atom
PDF