Project

General

Profile

Bug #24526

Mimic OSDs do not start after deleting some pools with size=1

Added by Vitaliy Filippov almost 6 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After some amount of test actions involving creating pools with size=min_size=1 and then deleting them, most OSDs fail to start with the following message:

-1> 2018-06-14 18:21:50.273 7f00cc19b1c0 -1 osd.0 2402 init missing pg_pool_t for deleted pool 16 for pg 16.42; please downgrade to luminous and allow pg deletion to complete before upgrading

That wasn't an upgrade from Luminous so I think it's a bug: Mimic should not generate database states leading to creation of legacy pgs.

History

#1 Updated by Vitaliy Filippov almost 6 years ago

P.S: This happened just after deleting some pool with size=1 - several OSDs died immediately and the latest error message before `please downgrade to luminous` was:

2018-06-14 18:17:19.388 7f69f667a700 -1 /root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: In function 'int OSDService::get_deleted_pool_pg_num(int64_t)' thread 7f69f667a700 time 2018-06-14 18:17:19.385362
/root/rpmbuild/BUILD/ceph-13.2.0/src/osd/OSD.cc: 1430: FAILED assert(r >= 0)

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f6a0e47d45f]
 2: (()+0x284627) [0x7f6a0e47d627]
 3: (()+0x3b539f) [0x55e072b5e39f]
 4: (OSDService::identify_split_children(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, spg_t, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xea) [0x55e072b5e48a]
 5: (OSDShard::identify_splits(std::shared_ptr<OSDMap const> const&, std::set<spg_t, std::less<spg_t>, std::allocator<spg_t> >*)+0xfb) [0x55e072b5ea1b]
 6: (OSD::consume_map()+0x215) [0x55e072b6afa5]
 7: (OSD::_committed_osd_maps(unsigned int, unsigned int, MOSDMap*)+0x640) [0x55e072b6c0f0]
 8: (C_OnMapCommit::finish(int)+0x17) [0x55e072bc6c87]
 9: (Context::complete(int)+0x9) [0x55e072b8b829]
 10: (Finisher::finisher_thread_entry()+0x12e) [0x7f6a0e47b9de]
 11: (()+0x7e25) [0x7f6a0afd7e25]
 12: (clone()+0x6d) [0x7f6a0a0c8bad]

#2 Updated by Vitaliy Filippov almost 6 years ago

I solved this issue by monkey-patching OSD code:

From a6c789276d1897a6f2426638939ffe0e7853d4bc Mon Sep 17 00:00:00 2001
From: Vitaliy Filippov <vitalif@yourcmc.ru>
Date: Thu, 14 Jun 2018 22:07:42 +0300
Subject: [PATCH 2/2] Ignore "pg without pool" errors

---
 src/osd/OSD.cc | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index b026401..d72082b 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -1427,13 +1427,16 @@ int OSDService::get_deleted_pool_pg_num(int64_t pool)
   ghobject_t oid = OSD::make_final_pool_info_oid(pool);
   bufferlist bl;
   int r = store->read(meta_ch, oid, 0, 0, bl);
-  ceph_assert(r >= 0);
-  auto blp = bl.begin();
-  pg_pool_t pi;
-  ::decode(pi, blp);
-  deleted_pool_pg_nums[pool] = pi.get_pg_num();
-  dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl;
-  return pi.get_pg_num();
+  if (r >= 0) {
+    auto blp = bl.begin();
+    pg_pool_t pi;
+    ::decode(pi, blp);
+    deleted_pool_pg_nums[pool] = pi.get_pg_num();
+    dout(20) << __func__ << " " << pool << " got " << pi.get_pg_num() << dendl;
+  } else {
+    deleted_pool_pg_nums[pool] = 0;
+  }
+  return deleted_pool_pg_nums[pool];
 }

 OSDMapRef OSDService::_add_map(OSDMap *o)
@@ -2512,7 +2515,7 @@ int OSD::init()
            << pgid.pool() << " for pg " << pgid
            << "; please downgrade to luminous and allow " 
            << "pg deletion to complete before upgrading" << dendl;
-      ceph_abort();
+//      ceph_abort();
     }
       }
     }
-- 
1.8.3.1

But of course I should be very glad if you also fix it in mimic itself :)

#3 Updated by Igor Fedotov over 5 years ago

  • Project changed from bluestore to RADOS

Also available in: Atom PDF