Project

General

Profile

Bug #56508

haproxy check fails for ceph-grafana service

Added by Francesco Pantano 5 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Category:
cephadm
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If OSP is deployed with ceph-dashboard there are multiple ceph-dashboard services deployed and place behind haproxy, one of the services is grafana.

The following haproxy configuration is generated for grafana on OSP:

listen ceph_grafana
bind 192.168.24.71:3100 transparent ssl crt /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem
mode http
balance source
http-request set-header X-Forwarded-Proto https if { ssl_fc }
http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
http-request set-header X-Forwarded-Port %[dst_port]
option httpchk HEAD /
option httplog
option forwardfor
server central-controller-0.storage.redhat.local 172.23.1.55:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-0.storage.redhat.local
server central-controller-1.storage.redhat.local 172.23.1.124:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-1.storage.redhat.local
server central-controller-2.storage.redhat.local 172.23.1.243:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-2.storage.redhat.local

The haproxy configuration for grafana service seems to be correct and haproxy does service backend checks regularly.

The problem seems to be that the check fails, the grafana service complains every 2 seconds about:

2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.243:56364: remote error: tls: internal error
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.55:52296: remote error: tls: internal error
2022/06/21 12:36:01 http: TLS handshake error from 172.23.1.124:52898: remote error: tls: internal error

The reason is that all the grafana server containers on all the controller nodes (in my case grafana is deployed on controllers) have the same SSL certificate and key deployed in /etc/grafana/certs/cert_file|key,

The haproxy check is successful to grafana on controller-0 but fails to the other grafana backends because the grafana containers have the same certificate generated for controller-0 deployed in /etc/grafana/certs/cert_file|key.

The container's file /etc/grafana/certs/cert_file are bind to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_key on the hosts and it's the same files on all the hosts but the certificates in /etc/pki/tls/certs/ceph_grafana.crt are different and correctly generated for each host.

If I copy /etc/pki/tls/certs/ceph_grafana.crt to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_file and restart grafana containers on all hosts, The haproxy check starts to be successful.

This seems a side effect of the transitioning from ceph-ansible to cephadm: ceph-ansible used to configure the grafana containers via [1], and the template [2] reference the certificate generated for that node; also, the certificate was copied through [3], and /etc/grafana is mounted (-v /etc/grafana) when the container starts.
The above ensures the right certificate is always present in the current node where grafana is started.
However, cephadm is spec driven, and there's no logic to reference a diff certificate per instance because it's a config-key within the cluster [4], and it's global for all the grafana instances.
This is something that should be addressed by cephadm, just because you have the ability to deploy multiple grafana instances on multiple nodes, but not sure it's something currently supported.

[1] https://github.com/ceph/ceph-ansible/tree/main/roles/ceph-grafana
[2] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-grafana/templates/grafana.ini.j2#L19-L20
[3] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-grafana/tasks/configure_grafana.yml#L73-L95
[4] https://docs.ceph.com/en/latest/cephadm/services/monitoring/#configuring-ssl-tls-for-grafana


Related issues

Related to Orchestrator - Documentation #47637: mgr/cephadm: document how to configure custom TLS certificate for Grafana Resolved
Copied to Orchestrator - Backport #57383: quincy: haproxy check fails for ceph-grafana service Resolved
Copied to Orchestrator - Backport #57384: pacific: haproxy check fails for ceph-grafana service Resolved

History

#1 Updated by Ernesto Puerta 5 months ago

Couldn't this be solved by generating wildcard certificates (*.grafana.<domain>) and some kind of hostname resolution (1.grafana.<domain>, 2.grafana.<domain>, ...)?

Maybe it's to reconsider/reevaluate some kind of internal FQDN/hostname addressing? Maybe the podman /etc/hosts management is now more consistent with Docker's than a year ago, or both have converged?

#2 Updated by Redouane Kachach Elhichou 5 months ago

I changed the cephadm code by the following PR:

https://github.com/ceph/ceph/pull/47098

to store the grafana cert/key per node. Now instead of using the same path mgr/cephadm/grafana_crt for all the nodes we store a different cert/key per each node using the path: mgr/cephadm/<hostname>/grafana_crt

Since I can't reproduce the issue on my env it would be great if you can get the code of this PR and test it on your env. I'll follow up with you with any support you need to test/fix the issues.

#3 Updated by Redouane Kachach Elhichou 5 months ago

  • Related to Documentation #47637: mgr/cephadm: document how to configure custom TLS certificate for Grafana added

#4 Updated by Redouane Kachach Elhichou 4 months ago

  • Status changed from New to Need More Info
  • Assignee set to Redouane Kachach Elhichou

#5 Updated by Ilya Dryomov 4 months ago

  • Target version changed from v16.2.10 to v16.2.11

#6 Updated by Francesco Pantano 4 months ago

Hi Redouane,
thanks for this change, just left a comment on the associated PR.
We don't have much cycles atm to help testing this particular change; however, if you can point me to a -pending/experimental ceph container that contains this fix I'll find some time to fix [1] accordingly and test your change with a TripleO deployed Ceph cluster.

[1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/monitoring.yaml#L55

#7 Updated by Adam King 3 months ago

  • Backport set to quincy, pacific

#8 Updated by Adam King 3 months ago

  • Status changed from Need More Info to Pending Backport
  • Pull request ID set to 47098

#9 Updated by Backport Bot 3 months ago

  • Copied to Backport #57383: quincy: haproxy check fails for ceph-grafana service added

#10 Updated by Backport Bot 3 months ago

  • Copied to Backport #57384: pacific: haproxy check fails for ceph-grafana service added

#11 Updated by Backport Bot 3 months ago

  • Tags set to backport_processed

#12 Updated by Adam King 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF