Bug #24395
closedCeph MGR Prometheus module errors during failover
0%
Description
To reproduce, you need to run a Rook build with this PR reverted [Revert "Run two MGRs to have one in standby mode" #1334](https://github.com/rook/rook/pull/1334).
1. Start Rook Ceph Cluster.
2. Verify two Ceph MGR Deployments/Pods (rook-ceph-mgr{0..1}
) have been spawned.
3. Scale down current active MGR.
4. Wait for mon mgr beacon grace
(30 seconds) to pass and the remaining MGR to get active
.
5. The second MGR get's stuck with send_beacon active (starting)
and eventually errors with:
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822031 I | ceph-mgr: 2018-06-02 11:16:06.821545 7fe495e16700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'prometheus' while running on mgr.rook-ceph-mgr1: IOError("Port 9283 not bound on '::'",)
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822070 I | ceph-mgr: 2018-06-02 11:16:06.821588 7fe495e16700 -1 prometheus.serve:
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822074 I | ceph-mgr: 2018-06-02 11:16:06.821599 7fe495e16700 -1 Traceback (most recent call last):
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822076 I | ceph-mgr: File "/usr/lib64/ceph/mgr/prometheus/module.py", line 673, in serve
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822079 I | ceph-mgr: cherrypy.engine.start()
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822081 I | ceph-mgr: File "/usr/lib/python2.7/site-packages/cherrypy/process/wspbus.py", line 250, in start
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822083 I | ceph-mgr: raise e_info
rook-ceph-mgr1-5574d59cb9-227jp rook-ceph-mgr1 2018-06-02 11:16:06.822086 I | ceph-mgr: ChannelFailures: IOError("Port 9283 not bound on '::'",)
Full logs can be found here: https://gist.github.com/galexrt/d32cab3c4029617c9e8180f66d286129
Only way to recover from this is to stop all MGRs, wait for the cluster to go into HEALTH_WARN no mgrs running
and then start the MGRs backup.
Another possible way to recover from that is to disable the Prometheus module but I haven't tested that.
Disabling the Prometheus module before hand, then going through the failover, the MGRs can "recover"/switch active
without hanging in active (starting)
forever.
This is with Ceph version:
ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable)
Updated by Jan Fajerski over 5 years ago
Is this still an issue? The cherrypy start and stop code has changed since then.
Updated by Jan Fajerski over 5 years ago
- Status changed from New to Closed
Closing due to age. Feel free to re-open if necessary.