Project

General

Profile

Actions

Bug #54028

open

alertmanager clustering is not configured consistently

Added by Paul Cuzner about 2 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
cephadm/monitoring
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy pacific
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After increasing the count for the alertmanager service, we see the number of alertmanager instances increase and prometheus is updated to reflect this.

However, the alertmanager daemons are not correctly peered. Each daemon should have a --cluster.peer <addr> for each peer in the cluster, but this is not the case

After increasing the count to 3 (from 1), the alertmanager execution looks like this
host1 : /bin/alertmanager --cluster.listen-address=:9094 --web.listen-address=:9093 --config.file=/etc/alertmanager/alertmanager.yml
host2 : /bin/alertmanager --cluster.listen-address=:9094 --web.listen-address=:9093 --cluster.peer=172.16.37.35:9094 --config.file=/etc/alertmanager/alertmanager.yml
host3 : /bin/alertmanager --cluster.listen-address=:9094 --web.listen-address=:9093 --cluster.peer=172.16.37.35:9094 --cluster.peer=10.1.36.191:9094 --config.file=/etc/alertmanager/alertmanager.yml

What should happen is that each instance should be pointing to the other peers, but as this output shows
host1 doesn't even reference any peers
host2 only references 1 peer (in this case host1's IP)
host3 gets it right!

Actions #1

Updated by Paul Cuzner about 2 years ago

  • Subject changed from alertmanager clustering is not configured correctly to alertmanager clustering is not configured consistently

Note that the cluster is formed, due to the other peer configurations

Actions #2

Updated by Redouane Kachach Elhichou almost 2 years ago

This happens because when we scale the daemons we query which alertmanager daemons are running and we use them to populate the cluster-peer info, so:
1- When the first one is deployed we don't have any other daemon yet (hence, it doesn't get any peer info)
2- When the second one is deployed it gets as peer the addr of the #1
3- When the third daemon is deployed it gets as peers #1 and #2 (this one get it right because we already have all of the daemons running)

I can't see an easy solution for this problem as you can "predict" on which host the daemon will end-up running. Fortunately there's a workarround for this issue that consists basically in redeploying the daemons #1 and #2.

Actions #3

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Assignee set to Redouane Kachach Elhichou
Actions #4

Updated by Redouane Kachach Elhichou over 1 year ago

  • Assignee deleted (Redouane Kachach Elhichou)
Actions

Also available in: Atom PDF