Project

General

Profile

Feature #47038

cephadm: Automatically deploy failed daemons on other hosts

Added by Sebastian Wagner over 3 years ago. Updated over 2 years ago.

Status:
New
Priority:
High
Assignee:
-
Category:
cephadm/scheduler
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.

lots of open questions here:

  • when exactly has a daemon failed?
  • do we need a timeout?
  • what about stopping a daemon on purpose?
  • This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
  • What if newly added daemons fail as well?

Related issues

Related to Orchestrator - Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts Duplicate
Related to Orchestrator - Feature #48624: ceph orch drain <host> Resolved
Related to Orchestrator - Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) Can't reproduce
Duplicated by Orchestrator - Feature #53378: cephadm: redeploy nfs-ganesha service that was running in a host that went offline Duplicate

History

#1 Updated by Nathan Cutler over 3 years ago

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

#2 Updated by Sebastian Wagner over 3 years ago

Nathan Cutler wrote:

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Those are actually really good reasons to make HAproxy part of cephadm!

#3 Updated by Sebastian Wagner over 3 years ago

  • Blocks Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added

#4 Updated by Sebastian Wagner over 3 years ago

  • Related to Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts added

#5 Updated by Sebastian Wagner over 3 years ago

  • Tracker changed from Bug to Feature

#6 Updated by Sebastian Wagner about 3 years ago

#7 Updated by Sebastian Wagner over 2 years ago

  • Related to Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added

#8 Updated by Sebastian Wagner over 2 years ago

  • Blocks deleted (Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts))

#9 Updated by Sebastian Wagner over 2 years ago

  • Duplicated by Feature #53378: cephadm: redeploy nfs-ganesha service that was running in a host that went offline added

#10 Updated by Sebastian Wagner over 2 years ago

  • Priority changed from Normal to High

Also available in: Atom PDF