Feature #47038
cephadm: Automatically deploy failed daemons on other hosts
0%
Description
currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.
lots of open questions here:
- when exactly has a daemon failed?
- do we need a timeout?
- what about stopping a daemon on purpose?
- This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
- What if newly added daemons fail as well?
Related issues
History
#1 Updated by Nathan Cutler over 3 years ago
Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:
OSD daemon: the target host might have a different set of disks containing completely different data?
NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?
IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?
RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?
MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?
#2 Updated by Sebastian Wagner over 3 years ago
Nathan Cutler wrote:
Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:
OSD daemon: the target host might have a different set of disks containing completely different data?
NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?
IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?
RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?
MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?
Those are actually really good reasons to make HAproxy part of cephadm!
#3 Updated by Sebastian Wagner over 3 years ago
- Blocks Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added
#4 Updated by Sebastian Wagner over 3 years ago
- Related to Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts added
#5 Updated by Sebastian Wagner over 3 years ago
- Tracker changed from Bug to Feature
#6 Updated by Sebastian Wagner about 3 years ago
- Related to Feature #48624: ceph orch drain <host> added
#7 Updated by Sebastian Wagner over 2 years ago
- Related to Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added
#8 Updated by Sebastian Wagner over 2 years ago
- Blocks deleted (Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts))
#9 Updated by Sebastian Wagner over 2 years ago
- Duplicated by Feature #53378: cephadm: redeploy nfs-ganesha service that was running in a host that went offline added
#10 Updated by Sebastian Wagner over 2 years ago
- Priority changed from Normal to High