Project

General

Profile

Actions

Bug #24395

closed

Ceph MGR Prometheus module errors during failover

Added by Alexander Trost almost 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

To reproduce, you need to run a Rook build with this PR reverted [Revert "Run two MGRs to have one in standby mode" #1334](https://github.com/rook/rook/pull/1334).
1. Start Rook Ceph Cluster.
2. Verify two Ceph MGR Deployments/Pods (rook-ceph-mgr{0..1}) have been spawned.
3. Scale down current active MGR.
4. Wait for mon mgr beacon grace (30 seconds) to pass and the remaining MGR to get active.
5. The second MGR get's stuck with send_beacon active (starting) and eventually errors with:

rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822031 I | ceph-mgr: 2018-06-02 11:16:06.821545 7fe495e16700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'prometheus' while running on mgr.rook-ceph-mgr1: IOError("Port 9283 not bound on '::'",)
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822070 I | ceph-mgr: 2018-06-02 11:16:06.821588 7fe495e16700 -1 prometheus.serve:
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822074 I | ceph-mgr: 2018-06-02 11:16:06.821599 7fe495e16700 -1 Traceback (most recent call last):
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822076 I | ceph-mgr:   File "/usr/lib64/ceph/mgr/prometheus/module.py", line 673, in serve
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822079 I | ceph-mgr:     cherrypy.engine.start()
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822081 I | ceph-mgr:   File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 250, in start
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822083 I | ceph-mgr:     raise e_info
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822086 I | ceph-mgr: ChannelFailures: IOError("Port 9283 not bound on '::'",)

Full logs can be found here: https://gist.github.com/galexrt/d32cab3c4029617c9e8180f66d286129

Only way to recover from this is to stop all MGRs, wait for the cluster to go into HEALTH_WARN no mgrs running and then start the MGRs backup.
Another possible way to recover from that is to disable the Prometheus module but I haven't tested that.
Disabling the Prometheus module before hand, then going through the failover, the MGRs can "recover"/switch active without hanging in active (starting) forever.

This is with Ceph version:

ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)

Actions #1

Updated by Jan Fajerski over 5 years ago

Is this still an issue? The cherrypy start and stop code has changed since then.

Actions #2

Updated by Jan Fajerski over 5 years ago

  • Assignee set to Jan Fajerski
Actions #3

Updated by Jan Fajerski over 5 years ago

  • Status changed from New to Closed

Closing due to age. Feel free to re-open if necessary.

Actions

Also available in: Atom PDF