Feature #47038: cephadm: Automatically deploy failed daemons on other hosts - Orchestrator - Ceph

Actions

Copy link

Feature #47038

open

cephadm: Automatically deploy failed daemons on other hosts

Added by Sebastian Wagner over 3 years ago. Updated over 2 years ago.

Status:

New

Priority:

High

Assignee:

Category:

cephadm/scheduler

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.

lots of open questions here:

when exactly has a daemon failed?
do we need a timeout?
what about stopping a daemon on purpose?
This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
What if newly added daemons fail as well?

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Nathan Cutler wrote:

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Those are actually really good reasons to make HAproxy part of cephadm!

Actions

Copy link