Project

General

Profile

Feature #47038

cephadm: Automatically deploy failed daemons on other hosts

Added by Sebastian Wagner 8 months ago. Updated 5 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cephadm/scheduler
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.

lots of open questions here:

  • when exactly has a daemon failed?
  • do we need a timeout?
  • what about stopping a daemon on purpose?
  • This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
  • What if newly added daemons fail as well?

Related issues

Related to Orchestrator - Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts New
Related to Orchestrator - Feature #48624: ceph orch drain <host> New
Blocks Orchestrator - Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) New

History

#1 Updated by Nathan Cutler 8 months ago

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

#2 Updated by Sebastian Wagner 8 months ago

Nathan Cutler wrote:

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Those are actually really good reasons to make HAproxy part of cephadm!

#3 Updated by Sebastian Wagner 8 months ago

  • Blocks Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added

#4 Updated by Sebastian Wagner 5 months ago

  • Related to Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts added

#5 Updated by Sebastian Wagner 5 months ago

  • Tracker changed from Bug to Feature

#6 Updated by Sebastian Wagner 3 months ago

Also available in: Atom PDF