Backport #38850: upgrade: 1 nautilus mon + 1 luminous mon can't automatically form quorum - RADOS - Ceph

Actions

Copy link

Backport #38850

closed

upgrade: 1 nautilus mon + 1 luminous mon can't automatically form quorum

Added by Tim Serong about 5 years ago. Updated almost 5 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Joao Eduardo Luis

Target version:

Ceph - v14.2.2

Release:

nautilus

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seen while upgrading Luminous (12.2.10) to Nautilus (14.2.0). Three mon hosts, four osd hosts. The process was:

- Shutdown mon1 (quorum in now mon2+mon3, both Luminous)
- Upgrade mon1 to Nautilus
- Start mon1 again. mon1 joins cluster, `ceph health` reports all three mons OK
- Shutdown mon2 (leaving mon1 = Nautilus and mon3 = Luminous)
- `ceph health` is now broken (eventually times out)
- mon1 logs repeat:

2019-03-22 09:50:21.634 7f31906d1700  1 mon.mon1@0(electing) e1  peer v1:172.16.1.13:6789/0 release  < min_mon_release, or missing features 0

- mon3 logs repeat:

2019-03-22 09:51:59.644672 7fa9f589a700 -1 mon.mon3@2(probing) e1 handle_probe missing features, have 4611087853745930235, required 0, missing 0

This means that the cluster is effectively down until you're able to complete the upgrade of mon2.

Curiously, on mon1 (Nautilus):

# ceph daemon mon.$(hostname) mon_status|grep min_mon_release
        "min_mon_release": 12,
        "min_mon_release_name": "luminous",

So why is it comlaining about release < min_mon_release?

Even more interesting, I can run this on the Luminous mon:

# ceph daemon mon.$(hostname) quorum enter
started responding to quorum, initiated new election

...and bam a few seconds later, we're in business again:

# ceph status
  cluster:
    id:     44e4a575-5c31-3c61-88c5-001ea49e8aaa
    health: HEALTH_WARN
            1/3 mons down, quorum mon1,mon3

  services:
    mon: 3 daemons, quorum mon1,mon3, out of quorum: mon2
    mgr: mon3(active), standbys: mon1
    osd: 30 osds: 30 up, 30 in

  data:
    pools:   1 pools, 512 pgs
    objects: 0 objects, 0B
    usage:   30.3GiB used, 567GiB / 597GiB avail
    pgs:     512 active+clean

Not that "quorum enter" doesn't help if run from the Nautilus mon, it only works when run from the Luminous mon.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Backport #38850

upgrade: 1 nautilus mon + 1 luminous mon can't automatically form quorum

Updated by Ricardo Dias about 5 years ago

Updated by Tim Serong about 5 years ago

Updated by Joao Eduardo Luis about 5 years ago

Updated by Lars Marowsky-Brée about 5 years ago

Updated by Greg Farnum about 5 years ago

Updated by Joao Eduardo Luis almost 5 years ago

Updated by Joao Eduardo Luis almost 5 years ago

Updated by Yuri Weinstein almost 5 years ago

Updated by Nathan Cutler almost 5 years ago