Feature #47038: cephadm: Automatically deploy failed daemons on other hosts - Orchestrator - Ceph

Custom queries

Bug queue
Bug triage
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub

Actions

Copy link

Feature #47038

open

cephadm: Automatically deploy failed daemons on other hosts

Added by Sebastian Wagner over 3 years ago. Updated over 2 years ago.

Status:

New

Priority:

High

Assignee:

Category:

cephadm/scheduler

Target version:

% Done:

Source:

Tags:

Backport:

Reviewed:

Affected Versions:

Pull request ID:

Description

currently cephadm doesn't automatically re-distribute containers to new hosts. Right now, this is a manual step.

lots of open questions here:

when exactly has a daemon failed?
do we need a timeout?
what about stopping a daemon on purpose?
This has the potential to really badly break a cluster, if newly created MONs won't properly form a quorum.
What if newly added daemons fail as well?

Related issues 4 (0 open — 4 closed)

Related to Orchestrator - Feature #47782: ceph orch host rm <host> is not stopping the services deployed in the respective removed hosts

Duplicate

Actions

Related to Orchestrator - Feature #48624: ceph orch drain <host>

Resolved

Daniel Pivonka

Actions

Related to Orchestrator - Bug #43838: cephadm: Forcefully Remove Services (unresponsive hosts)

Can't reproduce

Actions

Has duplicate Orchestrator - Feature #53378: cephadm: redeploy nfs-ganesha service that was running in a host that went offline

Duplicate

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Nathan Cutler over 3 years ago

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Actions

Copy link

Updated by Sebastian Wagner over 3 years ago

Nathan Cutler wrote:

Which daemon types would this apply to? I can think of more potential problems, depending on daemon type:

OSD daemon: the target host might have a different set of disks containing completely different data?

NFS daemon: NFS clients might be expecting the NFS server to be on the original (pre-move) host?

IGW daemon: IGW clients might be expecting the IGW server to be on the original (pre-move) host?

RGW daemon: RGW clients might be expecting the RGW server to be on the original (pre-move) host?

MDS daemon: MDS already has a failover system, which might get confused if a failed MDS suddenly re-appears on a different host?

Those are actually really good reasons to make HAproxy part of cephadm!

Actions

Copy link