Project

General

Profile

Bug #47038

cephadm: Automatically deploy failed daemons on other hosts

Added by Sebastian Wagner 2 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cephadm/scheduler
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.

lots of open questions here:

  • when exactly has a daemon failed?
  • do we need a timeout?
  • what about stopping a daemon on purpose?
  • This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
  • What if newly added daemons fail as well?

Related issues

Blocks Orchestrator - Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) New

History

#1 Updated by Nathan Cutler 2 months ago

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

#2 Updated by Sebastian Wagner 2 months ago

Nathan Cutler wrote:

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Those are actually really good reasons to make HAproxy part of cephadm!

#3 Updated by Sebastian Wagner 2 months ago

  • Blocks Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts) added

Also available in: Atom PDF