Support #47233: cephadm: orch apply mon "label:osd" crashes cluster - Orchestrator - Ceph

Actions

Copy link

Support #47233

closed

cephadm: orch apply mon "label:osd" crashes cluster

Added by Gunther Heinrich over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

cephadm

Target version:

Ceph - v15.2.5

% Done:

Tags:

Reviewed:

Affected Versions:

Pull request ID:

Description

I have a virtual Ceph cluster with 6 VMs/Hosts running on Ubuntu Server 20.04. The cluster is running on Podman.
Three hosts are Mons and three hosts are OSDs with a corresponding label each:

HOST    ADDR    LABEL
mon-01  mon-01  mon
mon-02  mon-02  mon
mon-03  mon-03  mon
osd-01  osd-01  osd
osd-02  osd-02  osd
osd-03  osd-03  osd

The virtual cluster is running and everything is OK/Healthy.

Now I enter a command in the orchestrator with an error in terms of the label used:

sudo ceph orch apply mon "label:osd"

A few seconds later, the cluster basically offline. Administrative commands as well as any reporting do not work.
A check of the running Podman containers on each VM shows that the Mon-Containers are gone from all Mon-Hosts while they're present on all OSD-Hosts. Other Monitor related containers like MDS, Grafana and MGR are still running on the Mon-Hosts.
A reboot didn't resolve the problem. I also tried to install cephadm and ceph-common via cephadm on all OSDs (including copying all necessary files like ceph admin keyring) in the hope to resolve the issue this way but it also didn't help.

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Do you have the MGR logs? https://docs.ceph.com/docs/master/cephadm/troubleshooting/

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Subject changed from Orchestrator command crashes cluster to cephadm: orch apply mon "label:osd" crashes cluster
Category changed from orchestrator to cephadm
Priority changed from Normal to Urgent

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Ha! you should now have three MONs on osd-01 osd-02 and osd-03

Unfortunately your /etc/ceph/ceph.conf is outdated now, as the MONs are now all located on different hosts.

now, please:

1. make sure your /etc/ceph/ceph.conf points to the correct MONs
2. run `cephadm ls` on your OSD hosts.

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Tracker changed from Bug to Support

Actions

Copy link

Updated by Gunther Heinrich over 3 years ago

Thanks for your feedback. Your solution is working perfectly, I edited the ceph.conf and afterwards the cluster status came back again. I didn't have to run 'cephadm ls' on all OSDs but I did on OSD-01.
The question remains how cephadm could prevent such situation in the first place.

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

G. Heinrich wrote:

how cephadm could prevent such situation in the first place.

We need to encourage users to use the yaml based way of deploying services as outlined here in the grey box: https://docs.ceph.com/docs/master/cephadm/install/#deploy-additional-monitors-optional

using the CLI is just too dangerous.

Actions

Copy link

Updated by Gunther Heinrich over 3 years ago

I support this notion as it will help to massively reduce the risk of doing somewhat obvious errors. The yaml approach doesn't fully eliminate the risk, though, since I was able crush the cluster the same way by using the following yaml:

service_type: mon
placement:
        label: "osd"

I'm not sure what could be done to completely remove the risk of such errors by an admin. Although this here is an obvious error there might be more obscure ones. I have some ideas for this but I'm not sure whether I should add it here or post a separate suggestion ticket.

Actions

Copy link