Bug #50230: mon: spawn loop after mon reinstalled - RADOS - Ceph

Actions

Copy link

Bug #50230

closed

mon: spawn loop after mon reinstalled

Added by Dan van der Ster about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Dan van der Ster

Category:

Target version:

% Done:

Source:

Tags:

Backport:

nautilus,octopus,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

40660

Crash signature (v1):

Crash signature (v2):

Description

This is related to #44076. (cluster is running 14.2.19 which has that fix.)

Scenario:

mon is reinstalled (upgraded OS from el7 to el8).
mkfs writes the monmap to the store at ("mkfs", "monmap") [1]
During initial boot, the mon connects to the cluster, gets the latest monmap and sees the addr is different, so it stashes the new map at ("mon_sync", "temp_newer_monmap") and respawns. [2]
During the next boot, the code in `obtain_monmap` checks for the temp_newer_monmap iff store.exists("monmap", "last_committed"), which is still empty at this point.
So, finding only the mkfs monmap, the mon goes into a respawn loop.

I posted a log at ceph-post-file: 2ab3ff0f-87e9-47ea-8f31-a9d4ebc3e60c
debug_mon = 20 starts at 2021-04-08 11:06:52.928.

Here's a possible fix, not tested... we should check for temp_newer_monmap before falling back to the mkfs monmap:

> git diff                                                     11:24:51
diff --git a/src/ceph_mon.cc b/src/ceph_mon.cc
index 306d663d33a..f9712ef96e2 100644
--- a/src/ceph_mon.cc
+++ b/src/ceph_mon.cc
@@ -128,6 +128,24 @@ int obtain_monmap(MonitorDBStore &store, bufferlist &bl)
     int err = store.get("mkfs", "monmap", bl);
     ceph_assert(err == 0);
     ceph_assert(bl.length() > 0);
+
+    // see if there is stashed newer map (see bootstrap())
+    if (store.exists("mon_sync", "temp_newer_monmap")) {
+       bufferlist bl2;
+       int err = store.get("mon_sync", "temp_newer_monmap", bl2);
+       ceph_assert(err == 0);
+       ceph_assert(bl2.length() > 0);
+       MonMap b;
+       b.decode(bl2);
+       if (b.get_epoch() > latest_ver) {
+         dout(10) << __func__ << " using stashed monmap " << b.get_epoch()
+                  << " instead" << dendl;
+         bl = std::move(bl2);
+       } else {
+         dout(10) << __func__ << " ignoring stashed monmap " << b.get_epoch()
+                  << dendl;
+       }
+    }
     return 0;
   }

[1] dump-keys after mkfs:

mkfs / keyring
mkfs / monmap
monitor / cluster_uuid
monitor / feature_set
monitor / magic

[2] dump-keys after respawn loop:

mkfs / keyring
mkfs / monmap
mon_sync / temp_newer_monmap
monitor / cluster_uuid
monitor / feature_set
monitor / magic

[3] P.S. the only way we managed to bootstrap this mon is by adding --monmap <the latest monmap> at mkfs time.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Dan van der Ster about 3 years ago

Doh, ignore that fix, this is better:

diff --git a/src/ceph_mon.cc b/src/ceph_mon.cc
index 306d663d33a..f2d417ac465 100644
--- a/src/ceph_mon.cc
+++ b/src/ceph_mon.cc
@@ -123,6 +123,14 @@ int obtain_monmap(MonitorDBStore &store, bufferlist &bl)
     }
   }

+  if (store.exists("mon_sync", "temp_newer_monmap")) {
+    dout(10) << __func__ << " found temp_newer_monmap" << dendl;
+    int err = store.get("mon_sync", "temp_newer_monmap", bl);
+    ceph_assert(err == 0);
+    ceph_assert(bl.length() > 0);
+    return 0;
+  }
+
   if (store.exists("mkfs", "monmap")) {
     dout(10) << __func__ << " found mkfs monmap" << dendl;
     int err = store.get("mkfs", "monmap", bl);

Actions

Copy link

Updated by Dan van der Ster about 3 years ago

Pull request ID set to 40660

Actions

Copy link

Updated by Dan van der Ster about 3 years ago

Status changed from New to Fix Under Review
Assignee set to Dan van der Ster

We have tested the fix in PR 40660 and it solves our bootstrapping problem.

Actions

Copy link

Updated by Sage Weil about 3 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Kefu Chai about 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot about 3 years ago

Copied to Backport #50795: nautilus: mon: spawn loop after mon reinstalled added

Actions

Copy link

Updated by Backport Bot about 3 years ago

Copied to Backport #50796: octopus: mon: spawn loop after mon reinstalled added

Actions

Copy link

Updated by Backport Bot about 3 years ago

Copied to Backport #50797: pacific: mon: spawn loop after mon reinstalled added

Actions

Copy link

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #50230

mon: spawn loop after mon reinstalled

Updated by Dan van der Ster about 3 years ago

Updated by Dan van der Ster about 3 years ago

Updated by Dan van der Ster about 3 years ago

Updated by Sage Weil about 3 years ago

Updated by Kefu Chai about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Loïc Dachary almost 3 years ago