Project

General

Profile

Documentation #45936

cephadm: document restart the whole cluster

Added by Sebastian Wagner 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

[15:23:28] <dcapone2004> I have a ceph dev cluster of 3 nodes deployed using cephadm with octopus on centos 8
[15:23:42] <-- beigestair (~beigestai@0BGAAAFRJ.tor-irc.dnsbl.oftc.net) hat das Netzwerk verlassen (Remote host closed the connection)
[15:24:26] <dcapone2004> this is going to be a hyperconverged openstack cluster, that i am essentially testing....a key element is that the location that we deploy this cluster at will change in about 18 months, so I have been trying to write up procedures to safely shut down the cluster and power it back  up
[15:24:35] <dcapone2004> this is the latest place where i have run into an issue
[15:24:52] --> ragedragon (~ragedrago@i16-lef02-th2-89-83-167-245.ft.lns.abo.bbox.fr) hat #ceph betreten
[15:25:39] <dcapone2004> I stopped all disk activity on the cluster, set osd noout, then sht down the nodes of the cluster 1 by 1, with the active Manager being LAST
[15:25:55] --> beigestair (~beigestai@9J5AADHU6.tor-irc.dnsbl.oftc.net) hat #ceph betreten
[15:26:24] <dcapone2004> a few hours later, I tried to power the cluster back up starting with the last active manager and going in the reverse order that i shut the down
[15:26:44] <dcapone2004> and no I lost 2 OSD containers and all my manager containers
[15:26:48] <dcapone2004> now*
[15:27:10] <dcapone2004> ceph orch daemon redploy does nothing, nor does restart
[15:27:43] <dcapone2004> and when simply trying podman start, podman claims to not know about those containers, but the ceph dashboard shows the OSDs are in but down
[15:28:38] <SebastianW> dcapone2004: what does "I lost my manager containers" mean?
[15:28:57] <dcapone2004> meaning ceph -s shows no active manager containers
[15:29:31] <dcapone2004> podman start mgr.dev-lx-ceph11 (my hostname) says the container doesnt exist
[15:31:22] <dcapone2004> originally when i first started it up I only lost 1 of 2 containers, then after trying to use redeploy, the second disappeared....I am unsure if this is/was related to my attempt to upgrade to 15.2.3 which failed (and I filed a bug report for) and major the inconsistant version numbers between containers caused this issue

Also available in: Atom PDF