Feature #47038
opencephadm: Automatically deploy failed daemons on other hosts
0%
Description
currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.
lots of open questions here:
- when exactly has a daemon failed?
- do we need a timeout?
- what about stopping a daemon on purpose?
- This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
- What if newly added daemons fail as well?
Updated by Nathan Cutler over 3 years ago
Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:
OSD daemon: the target host might have a different set of disks containing completely different data?
NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?
IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?
RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?
MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?
Updated by Sebastian Wagner over 3 years ago
Nathan Cutler wrote:
Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:
OSD daemon: the target host might have a different set of disks containing completely different data?
NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?
IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?
RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?
MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?
Those are actually really good reasons to make HAproxy part of cephadm!
Updated by Sebastian Wagner over 3 years ago
- Blocks Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added
Updated by Sebastian Wagner over 3 years ago
- Related to Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts added
Updated by Sebastian Wagner over 3 years ago
- Tracker changed from Bug to Feature
Updated by Sebastian Wagner over 3 years ago
- Related to Feature #48624: ceph orch drain <host> added
Updated by Sebastian Wagner over 2 years ago
- Related to Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added
Updated by Sebastian Wagner over 2 years ago
- Blocks deleted (Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts))
Updated by Sebastian Wagner over 2 years ago
- Has duplicate Feature #53378: cephadm: redeploy nfs-ganesha service that was running in a host that went offline added
Updated by Sebastian Wagner over 2 years ago
- Priority changed from Normal to High