Project

General

Profile

Actions

Bug #52929

open

mgr/prometheus: mgr triggers ERROR when promoted from standby to active "Port 9283 not bound on ..."

Added by Tim Small over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
prometheus module
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description of problem

When in standby mode ceph-mgr binds to TCP port 9283 on all IP addresses ( "TCP *:9283" ). On promotion to active, ceph-mgr tries to bind TCP port on a specific IP address on the host, and fails. This puts the cluster into ERROR state. Stopping and restarting the mgr (quickly, so that failover does not occur, but instead the mgr starts in active mode) clears the error.

Environment

  • ceph version string: 6.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
  • Platform (OS/distro/release): Linux/Debian/11.1 (bullseye)
  • Cluster details (nodes, monitors, OSDs): 3, 3, 3

How reproducible

Always reproducible

Actual results

2021-10-14T10:22:21.777354+0100 mgr.medlar [ERR] Unhandled exception from module 'prometheus' while running on mgr.medlar: OSError("Port 9283 not bound on 'fec0:bbbb::5'")

Expected results

mgr standby -> active promotion should not raise MGR_MODULE_ERROR

Additional info

Looking at the code (pacific branch), I think this is caused by two different sets of logic being used to derive the address to listen on for the prometheus exporter.

In :

Module->serve() uses:

server_addr = cast(str, self.get_localized_module_option(
    'server_addr', get_default_addr()))

... but then later conditionally changes this:

if server_addr in ['::', '0.0.0.0']:
    server_addr = self.get_mgr_ip()
self.set_uri(build_url(scheme='http', host=server_addr, port=server_port, path='/'))

where as in StandbyModule->serve() the later test:

server_addr = self.get_localized_module_option(
    'server_addr', get_default_addr())

In my config I have:
ceph config-key get config/mgr/mgr/prometheus/server_addr : 0.0.0.0

and ceph config show mgr.nectarine : public_addr v2:[fec0:bbbb::5]:0/0

So I think that switching from standby mode to active causes the mgr to try and listen on the specific mgr IP address, but since it is already bound to the "any" address, this fails.

Probably the prometheus exporter should bind to the same IP address regardless of its active/standby state, and the conditional test which is applied in the active state should also occur for the standby state.

Actions #1

Updated by Neha Ojha over 2 years ago

  • Category changed from ceph-mgr to prometheus module
Actions

Also available in: Atom PDF