upgrade: 1 nautilus mon + 1 luminous mon can't automatically form quorum
Seen while upgrading Luminous (12.2.10) to Nautilus (14.2.0). Three mon hosts, four osd hosts. The process was:
- Shutdown mon1 (quorum in now mon2+mon3, both Luminous)
- Upgrade mon1 to Nautilus
- Start mon1 again. mon1 joins cluster, `ceph health` reports all three mons OK
- Shutdown mon2 (leaving mon1 = Nautilus and mon3 = Luminous)
- `ceph health` is now broken (eventually times out)
- mon1 logs repeat:
2019-03-22 09:50:21.634 7f31906d1700 1 mon.mon1@0(electing) e1 peer v1:172.16.1.13:6789/0 release < min_mon_release, or missing features 0
- mon3 logs repeat:
2019-03-22 09:51:59.644672 7fa9f589a700 -1 mon.mon3@2(probing) e1 handle_probe missing features, have 4611087853745930235, required 0, missing 0
This means that the cluster is effectively down until you're able to complete the upgrade of mon2.
Curiously, on mon1 (Nautilus):
# ceph daemon mon.$(hostname) mon_status|grep min_mon_release "min_mon_release": 12, "min_mon_release_name": "luminous",
So why is it comlaining about release < min_mon_release?
Even more interesting, I can run this on the Luminous mon:
# ceph daemon mon.$(hostname) quorum enter started responding to quorum, initiated new election
...and bam a few seconds later, we're in business again:
# ceph status cluster: id: 44e4a575-5c31-3c61-88c5-001ea49e8aaa health: HEALTH_WARN 1/3 mons down, quorum mon1,mon3 services: mon: 3 daemons, quorum mon1,mon3, out of quorum: mon2 mgr: mon3(active), standbys: mon1 osd: 30 osds: 30 up, 30 in data: pools: 1 pools, 512 pgs objects: 0 objects, 0B usage: 30.3GiB used, 567GiB / 597GiB avail pgs: 512 active+clean
Not that "quorum enter" doesn't help if run from the Nautilus mon, it only works when run from the Luminous mon.
#2 Updated by Tim Serong 3 months ago
Just to clarify slightly -- I know the upgrade instructions in the Nautilus release announcement say to "upgrade monitors by installing the new packages and restarting the monitor daemons", but this quick way of upgrading is not always possible; depending on what distro you're using, you may have to upgrade the base OS before you can install the new Nautilus packages, which means that each node is going to be down for quite some time (at least several minutes, maybe many tens of minutes or longer).
#6 Updated by Joao Eduardo Luis about 1 month ago
I have been working on it, able to reproduce, just unable yet to pin down the cause.
Reproducing basically takes the following steps:
1. 3 monitors on luminous (a, b, c)
2. shutdown mon.a, let quorum form (2 luminous monitors, b and c)
3. upgrade mon.a to nautilus; quorum forms with a, b, and c.
4. shutdown mon.b; quorum is unable to form between mon.a and mon.c
I'm trying to figure out which paths are involved here, and why that's happening. Evidence points to feature mismatch, but unable to pinpoint why just yet.