Project

General

Profile

Bug #46606

cephadm: post-bootstrap monitoring deployment only works if the command "ceph mgr module enable prometheus" has already been issued

Added by Nathan Cutler over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
cephadm/monitoring
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Post-bootstrap monitoring deployment only works if the command "ceph mgr module enable prometheus" has already been issued

Reported by Dmitri Savineau here: https://tracker.ceph.com/issues/46561#note-5

deploying the monitoring after the bootstrap requires to run an extra ceph command to enable the prometheus mgr module (which is automatically done during the bootstrap) [1]

[1] https://github.com/ceph/ceph/blob/master/src/cephadm/cephadm#L2877-L2879

History

#1 Updated by Nathan Cutler over 3 years ago

  • Related to Bug #46561: cephadm: monitoring services adoption doesn't honor the container image added

#2 Updated by Nathan Cutler over 3 years ago

  • Subject changed from Post-bootstrap monitoring deployment only works if the command "ceph mgr module enable prometheus" has already been issued to cephadm: post-bootstrap monitoring deployment only works if the command "ceph mgr module enable prometheus" has already been issued

#3 Updated by Nathan Cutler over 3 years ago

  • Description updated (diff)

#4 Updated by Nathan Cutler over 3 years ago

  • Description updated (diff)

#5 Updated by Sebastian Wagner over 3 years ago

  • Category set to cephadm/monitoring

#6 Updated by Sebastian Wagner about 3 years ago

  • Priority changed from Normal to High

#7 Updated by Juan Miguel Olmo Martínez about 3 years ago

  • Assignee set to Sebastian Wagner

#8 Updated by Sebastian Wagner about 3 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 39520

#9 Updated by Sebastian Wagner almost 3 years ago

  • Status changed from Fix Under Review to New

#10 Updated by Sebastian Wagner almost 3 years ago

  • Related to deleted (Bug #46561: cephadm: monitoring services adoption doesn't honor the container image)

#11 Updated by Sage Weil almost 3 years ago

A couple options:

- make the 'orch apply prometheus' fail if the mgr prometheus module isn't enabled. (maybe include a --force in case the user really wants to proceed?)
- make cephadm raise a health warning if there is a prometheus deployed but the prometheus module isn't enabled
- make 'orch apply prometheus' silently enable the prometheus module

#12 Updated by Sebastian Wagner almost 3 years ago

I'd definitively go for make 'orch apply prometheus' silently enable the prometheus module.

#13 Updated by Nathan Cutler almost 3 years ago

- make the 'orch apply prometheus' fail if the mgr prometheus module isn't enabled. (maybe include a --force in case the user really wants to proceed?)

This one is slightly problematic because there is not just "orch apply prometheus" with a prometheus-specific yaml blob, but also "orch apply" with a BIG yaml blob (with sections for various kinds of services/daemons).

Arguably, the "orch apply" command (with BIG yaml blob) should fail if any part of the yaml is not fulfillable. But that's not how the orchestrator works: the "orch apply" is fulfilled as a background task and when something goes wrong it's not always obvious to the user how to figure out what happened and why, since it typically involves conducting a post-mortem examination of the mgr logs.

To say it another way: "orch apply" is like a "moon shot". Everything has to be prepared in advance. Once the rocket is on its way up, there isn't any good way of aborting the mission.

(Caveat: this is just my impression as a casual user of "orch apply", not based on any deep knowledge of the code or even the design)

#14 Updated by Sebastian Wagner almost 3 years ago

  • Priority changed from High to Normal

prio=normal, as this is not trivial to implement

#15 Updated by Sebastian Wagner over 2 years ago

  • Assignee deleted (Sebastian Wagner)

#16 Updated by Sebastian Wagner over 2 years ago

  • Status changed from New to Resolved
  • Pull request ID changed from 39520 to 42682

PR 42682

Also available in: Atom PDF