Project

General

Profile

Bug #45093

cephadm: mgrs transiently getting co-located (one node gets two when only one was asked for)

Added by Nathan Cutler 6 months ago. Updated 24 days ago.

Status:
Pending Backport
Priority:
Urgent
Category:
cephadm/scheduler
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

After "ceph orch apply mgr node1,node2,node3", cluster has four MGRs

This started happening in master very recently, but I have not pinpointed which PR/commit did it:

After bootstrapping a cluster, I ask for three MGRs (total), as usual:

    master: ++ ceph orch apply mgr node1,node2,node3
    master: Scheduled mgr update...

Over the past few months there has been some semantic back-and-forth whether "ceph orch apply mgr node1,node2,node3" means to add three more MGRs to those already running, or whether it means I want three MGRs total. I thought the latter had finally won out, but apparently not?

master:~ # ceph-s
...
  services:
    mon: 3 daemons, quorum node1,node2,node3 (age 4h)
    mgr: node1.bwokiv(active, since 4h), standbys: node1.amxnnt, node2.hghrym, node3.vyanht
...

The Ceph version is:

master:~ # ceph --version
ceph version 16.0.0-704-g38ae96e1c9 (38ae96e1c9a4f8ad3095626c71951a122bdc8fe7) pacific (dev)
master:~ # ceph versions
{
    "mon": {
        "ceph version 16.0.0-704-g38ae96e1c9 (38ae96e1c9a4f8ad3095626c71951a122bdc8fe7) pacific (dev)": 3
    },
    "mgr": {
        "ceph version 16.0.0-704-g38ae96e1c9 (38ae96e1c9a4f8ad3095626c71951a122bdc8fe7) pacific (dev)": 4
    },
    "osd": {
        "ceph version 16.0.0-704-g38ae96e1c9 (38ae96e1c9a4f8ad3095626c71951a122bdc8fe7) pacific (dev)": 6
    },
    "mds": {},
    "overall": {
        "ceph version 16.0.0-704-g38ae96e1c9 (38ae96e1c9a4f8ad3095626c71951a122bdc8fe7) pacific (dev)": 13
    }
}

Related issues

Related to Orchestrator - Bug #45876: cephadm: handle port conflicts gracefully New

History

#1 Updated by Sebastian Wagner 6 months ago

  • Subject changed from After "ceph orch apply mgr node1,node2,node3", cluster has four MGRs to cephadm: mgrs are getting co-located
  • Description updated (diff)
  • Category set to cephadm
  • Priority changed from Normal to High

could you attach

ceph orch ls --format json
ceph orch ps --format json

?

#2 Updated by Sebastian Wagner 6 months ago

  • Status changed from New to Need More Info

#3 Updated by Nathan Cutler 6 months ago

  • Status changed from Need More Info to Can't reproduce

It is not 100% reproducible.

#4 Updated by Nathan Cutler 6 months ago

  • Status changed from Can't reproduce to New

It happened again. Here is the output of the commands you asked for:

https://paste2.org/CGsgjJWy

NOTE: this time it's an octopus cluster:

    "overall": {
        "ceph version 15.2.1-144-g71165f2e04 (71165f2e040917b0e4ef86257174c290ee0a6007) octopus (stable)": 15
    }

Note on reproducibility: this does not happen every time - by far. I'd say it happens approximately 5% of the time (rough estimate).

#5 Updated by Nathan Cutler 6 months ago

  • Subject changed from cephadm: mgrs are getting co-located to cephadm: mgrs transiently getting co-located (one node gets two when only one was asked for)

#6 Updated by Nathan Cutler 6 months ago

Note: this is not a matter of "I asked for 4 MGRs and got 4, only two were unexpectedly colocated."

What is happening is: "I ask for 3 MGRs, but I get 4 (about 1 time in 20)."

#7 Updated by Sebastian Wagner 6 months ago

---
placement:
  hosts:
  - hostname: node2
  - hostname: node3
service_id: myfs
service_name: mds.myfs
service_type: mds
status:
  container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
  container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
  created: '2020-04-28T17:14:08.426006'
  last_refresh: '2020-04-28T20:07:42.477694'
  running: 2
  size: 2
---
- placement:
  hosts:
  - hostname: node1
  - hostname: node2
  - hostname: node3
service_name: mgr
service_type: mgr
status:
  container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
  container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
  created: '2020-04-28T17:13:31.469050'
  last_refresh: '2020-04-28T20:07:42.477749'
  running: 4
  size: 3
---
placement:
  hosts:
  - hostname: node1
  - hostname: node2
  - hostname: node3
service_name: mon
service_type: mon
status:
  container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
  container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
  created: '2020-04-28T17:13:30.991819'
  last_refresh: '2020-04-28T20:07:42.477801'
  running: 3
  size: 3
---
data_devices:
  all: true
placement:
  host_pattern: node1*
service_id: testing_dg_node1
service_name: osd.testing_dg_node1
service_type: osd
status:
  running: 0
  size: 1
---
data_devices:
  all: true
placement:
  host_pattern: node2*
service_id: testing_dg_node2
service_name: osd.testing_dg_node2
service_type: osd
status:
  running: 0
  size: 1
---
data_devices:
  all: true
placement:
  host_pattern: node3*
service_id: testing_dg_node3
service_name: osd.testing_dg_node3
service_type: osd
status:
  running: 0
  size: 1
container_id: f3f939954580
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:15:02.020465'
daemon_id: myfs.node2.ostfwq
daemon_type: mds
hostname: node2
last_refresh: '2020-04-28T20:07:43.835147'
started: '2020-04-28T17:15:02.020339'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 94d8443eab4e
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:15:03.676511'
daemon_id: myfs.node3.jtbnlp
daemon_type: mds
hostname: node3
last_refresh: '2020-04-28T20:07:42.477694'
started: '2020-04-28T17:15:03.752398'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 5f62f0fbcd37
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:13:44.958197'
daemon_id: node1.ktsgzd
daemon_type: mgr
hostname: node1
last_refresh: '2020-04-28T20:07:45.245798'
started: '2020-04-28T17:13:45.034558'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 371661c2a597
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:12:28.251674'
daemon_id: node1.ymmyol
daemon_type: mgr
hostname: node1
last_refresh: '2020-04-28T20:07:45.245543'
started: '2020-04-28T17:12:28.369834'
status: 1
status_desc: running
version: 15.2.1
---
container_id: dd83ae4553e8
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:13:50.353163'
daemon_id: node2.pdorgs
daemon_type: mgr
hostname: node2
last_refresh: '2020-04-28T20:07:43.835313'
started: '2020-04-28T17:13:50.449624'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 9a5eeec94954
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:13:51.868881'
daemon_id: node3.nrkhfn
daemon_type: mgr
hostname: node3
last_refresh: '2020-04-28T20:07:42.477749'
started: '2020-04-28T17:13:51.969129'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 845adccfc358
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:12:23.507611'
daemon_id: node1
daemon_type: mon
hostname: node1
last_refresh: '2020-04-28T20:07:45.245853'
started: '2020-04-28T17:13:34.494148'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 645f552ea8da
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:13:38.165240'
daemon_id: node2
daemon_type: mon
hostname: node2
last_refresh: '2020-04-28T20:07:43.835206'
started: '2020-04-28T17:13:38.253470'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 168d38fa1e0e
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:13:41.300931'
daemon_id: node3
daemon_type: mon
hostname: node3
last_refresh: '2020-04-28T20:07:42.477801'
started: '2020-04-28T17:13:41.396329'
status: 1
status_desc: running
version: 15.2.1
---
container_id: ed3792f8040d
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:14:15.778109'
daemon_id: '0'
daemon_type: osd
hostname: node1
last_refresh: '2020-04-28T20:07:45.245687'
started: '2020-04-28T17:14:16.729388'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 30bcc5f16847
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:14:17.598099'
daemon_id: '1'
daemon_type: osd
hostname: node1
last_refresh: '2020-04-28T20:07:45.245745'
started: '2020-04-28T17:14:18.363908'
status: 1
status_desc: running
version: 15.2.1
---
container_id: fa082b37d0c6
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:14:42.256867'
daemon_id: '2'
daemon_type: osd
hostname: node2
last_refresh: '2020-04-28T20:07:43.835259'
started: '2020-04-28T17:14:43.199156'
status: 1
status_desc: running
version: 15.2.1
---
container_id: ad1ab7b4c7f6
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:14:44.148857'
daemon_id: '3'
daemon_type: osd
hostname: node2
last_refresh: '2020-04-28T20:07:43.835017'
started: '2020-04-28T17:14:44.968734'
status: 1
status_desc: running
version: 15.2.1
---
container_id: 3ad3f75f8a60
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:14:58.416553'
daemon_id: '4'
daemon_type: osd
hostname: node3
last_refresh: '2020-04-28T20:07:42.477479'
started: '2020-04-28T17:14:59.355699'
status: 1
status_desc: running
version: 15.2.1
---
container_id: b4b4af939a17
container_image_id: d4bfb5d9547d9f33926b5ca58608f79616ad54c20cb26d51faeebda40af93c67
container_image_name: registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph:latest
created: '2020-04-28T17:15:00.196544'
daemon_id: '5'
daemon_type: osd
hostname: node3
last_refresh: '2020-04-28T20:07:42.477634'
started: '2020-04-28T17:15:01.049624'
status: 1
status_desc: running
version: 15.2.1

#8 Updated by Sebastian Wagner 6 months ago

  • Priority changed from High to Urgent

#9 Updated by Sebastian Wagner 6 months ago

I really hope that https://github.com/ceph/ceph/pull/34633 will fix this.

#10 Updated by Sebastian Wagner 5 months ago

  • Category changed from cephadm to cephadm/scheduler

#11 Updated by Sebastian Wagner 4 months ago

I start to suspect that this comes from a race between host refresh and the scheduler, who starts to create new daemons, before the host was refreshed.

#12 Updated by Sebastian Wagner 4 months ago

only happens with MGRs

#13 Updated by Sebastian Wagner 4 months ago

  • Related to Bug #45876: cephadm: handle port conflicts gracefully added

#14 Updated by Sebastian Wagner 3 months ago

  • Assignee set to Matthew Oliver

#15 Updated by Nathan Cutler about 2 months ago

only happens with MGRs

It has recently been reported to happen at bootstrap, when supplying a spec that asks for 1 MON and 1 MGR (and nothing else). Sometimes it gives 2 MGRs instead of 1, and other times it gives 2 MONs instead of 1.

So (apparently) it's no longer limited to MGRs.

#16 Updated by Sebastian Wagner about 1 month ago

https://github.com/ceph/ceph/pull/36766 is probably going to help finding this.

#17 Updated by Sebastian Wagner about 1 month ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Matthew Oliver to Sebastian Wagner
  • Pull request ID set to 37135

#18 Updated by Nathan Cutler 24 days ago

  • Status changed from Fix Under Review to Pending Backport

Also available in: Atom PDF