Bug #52929
openmgr/prometheus: mgr triggers ERROR when promoted from standby to active "Port 9283 not bound on ..."
0%
Description
Description of problem¶
When in standby mode ceph-mgr binds to TCP port 9283 on all IP addresses ( "TCP *:9283" ). On promotion to active, ceph-mgr tries to bind TCP port on a specific IP address on the host, and fails. This puts the cluster into ERROR state. Stopping and restarting the mgr (quickly, so that failover does not occur, but instead the mgr starts in active mode) clears the error.
Environment¶
ceph version
string: 6.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)- Platform (OS/distro/release): Linux/Debian/11.1 (bullseye)
- Cluster details (nodes, monitors, OSDs): 3, 3, 3
How reproducible¶
Always reproducible
Actual results¶
2021-10-14T10:22:21.777354+0100 mgr.medlar [ERR] Unhandled exception from module 'prometheus' while running on mgr.medlar: OSError("Port 9283 not bound on 'fec0:bbbb::5'")
Expected results¶
mgr standby -> active promotion should not raise MGR_MODULE_ERROR
Additional info¶
Looking at the code (pacific branch), I think this is caused by two different sets of logic being used to derive the address to listen on for the prometheus exporter.
In :
Module->serve()
uses:
server_addr = cast(str, self.get_localized_module_option( 'server_addr', get_default_addr()))
... but then later conditionally changes this:
if server_addr in ['::', '0.0.0.0']: server_addr = self.get_mgr_ip() self.set_uri(build_url(scheme='http', host=server_addr, port=server_port, path='/'))
where as in StandbyModule->serve()
the later test:
server_addr = self.get_localized_module_option( 'server_addr', get_default_addr())
In my config I have:ceph config-key get config/mgr/mgr/prometheus/server_addr
: 0.0.0.0
and ceph config show mgr.nectarine
: public_addr v2:[fec0:bbbb::5]:0/0
So I think that switching from standby mode to active causes the mgr to try and listen on the specific mgr IP address, but since it is already bound to the "any" address, this fails.
Probably the prometheus exporter should bind to the same IP address regardless of its active/standby state, and the conditional test which is applied in the active state should also occur for the standby state.
Updated by Neha Ojha over 2 years ago
- Category changed from ceph-mgr to prometheus module