Project

General

Profile

Support #47233

cephadm: orch apply mon "label:osd" crashes cluster

Added by Gunther Heinrich over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
cephadm
Target version:
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

I have a virtual Ceph cluster with 6 VMs/Hosts running on Ubuntu Server 20.04. The cluster is running on Podman.
Three hosts are Mons and three hosts are OSDs with a corresponding label each:

HOST    ADDR    LABEL
mon-01  mon-01  mon
mon-02  mon-02  mon
mon-03  mon-03  mon
osd-01  osd-01  osd
osd-02  osd-02  osd
osd-03  osd-03  osd

The virtual cluster is running and everything is OK/Healthy.

Now I enter a command in the orchestrator with an error in terms of the label used:

sudo ceph orch apply mon "label:osd" 

A few seconds later, the cluster basically offline. Administrative commands as well as any reporting do not work.
A check of the running Podman containers on each VM shows that the Mon-Containers are gone from all Mon-Hosts while they're present on all OSD-Hosts. Other Monitor related containers like MDS, Grafana and MGR are still running on the Mon-Hosts.
A reboot didn't resolve the problem. I also tried to install cephadm and ceph-common via cephadm on all OSDs (including copying all necessary files like ceph admin keyring) in the hope to resolve the issue this way but it also didn't help.

History

#2 Updated by Sebastian Wagner over 2 years ago

  • Subject changed from Orchestrator command crashes cluster to cephadm: orch apply mon "label:osd" crashes cluster
  • Category changed from orchestrator to cephadm
  • Priority changed from Normal to Urgent

#3 Updated by Sebastian Wagner over 2 years ago

Ha! you should now have three MONs on osd-01 osd-02 and osd-03

Unfortunately your /etc/ceph/ceph.conf is outdated now, as the MONs are now all located on different hosts.

now, please:

1. make sure your /etc/ceph/ceph.conf points to the correct MONs
2. run `cephadm ls` on your OSD hosts.

#4 Updated by Sebastian Wagner over 2 years ago

  • Tracker changed from Bug to Support

#5 Updated by Gunther Heinrich over 2 years ago

Thanks for your feedback. Your solution is working perfectly, I edited the ceph.conf and afterwards the cluster status came back again. I didn't have to run 'cephadm ls' on all OSDs but I did on OSD-01.
The question remains how cephadm could prevent such situation in the first place.

#6 Updated by Sebastian Wagner over 2 years ago

G. Heinrich wrote:

how cephadm could prevent such situation in the first place.

We need to encourage users to use the yaml based way of deploying services as outlined here in the grey box: https://docs.ceph.com/docs/master/cephadm/install/#deploy-additional-monitors-optional

using the CLI is just too dangerous.

#7 Updated by Gunther Heinrich over 2 years ago

I support this notion as it will help to massively reduce the risk of doing somewhat obvious errors. The yaml approach doesn't fully eliminate the risk, though, since I was able crush the cluster the same way by using the following yaml:

service_type: mon
placement:
        label: "osd" 

I'm not sure what could be done to completely remove the risk of such errors by an admin. Although this here is an obvious error there might be more obscure ones. I have some ideas for this but I'm not sure whether I should add it here or post a separate suggestion ticket.

#8 Updated by Sebastian Wagner over 2 years ago

In my experience using yaml is typically enough to prevent those errors.

#9 Updated by Sebastian Wagner about 2 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF