Bug #56508: haproxy check fails for ceph-grafana service - Orchestrator - Ceph

Actions

Copy link

Bug #56508

closed

haproxy check fails for ceph-grafana service

Added by Francesco Pantano almost 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Redouane Kachach Elhichou

Category:

cephadm

Target version:

Ceph - v16.2.11

% Done:

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

47098

Crash signature (v1):

Crash signature (v2):

Description

If OSP is deployed with ceph-dashboard there are multiple ceph-dashboard services deployed and place behind haproxy, one of the services is grafana.

The following haproxy configuration is generated for grafana on OSP:

listen ceph_grafana
bind 192.168.24.71:3100 transparent ssl crt /etc/pki/tls/certs/haproxy/overcloud-haproxy-storage.pem
mode http
balance source
http-request set-header X-Forwarded-Proto https if { ssl_fc }
http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
http-request set-header X-Forwarded-Port %[dst_port]
option httpchk HEAD /
option httplog
option forwardfor
server central-controller-0.storage.redhat.local 172.23.1.55:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-0.storage.redhat.local
server central-controller-1.storage.redhat.local 172.23.1.124:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-1.storage.redhat.local
server central-controller-2.storage.redhat.local 172.23.1.243:3100 ca-file /etc/ipa/ca.crt check fall 5 inter 2000 rise 2 ssl verify required verifyhost central-controller-2.storage.redhat.local

The haproxy configuration for grafana service seems to be correct and haproxy does service backend checks regularly.

The problem seems to be that the check fails, the grafana service complains every 2 seconds about:

2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.243:56364: remote error: tls: internal error
2022/06/21 12:36:00 http: TLS handshake error from 172.23.1.55:52296: remote error: tls: internal error
2022/06/21 12:36:01 http: TLS handshake error from 172.23.1.124:52898: remote error: tls: internal error

The reason is that all the grafana server containers on all the controller nodes (in my case grafana is deployed on controllers) have the same SSL certificate and key deployed in /etc/grafana/certs/cert_file|key,

The haproxy check is successful to grafana on controller-0 but fails to the other grafana backends because the grafana containers have the same certificate generated for controller-0 deployed in /etc/grafana/certs/cert_file|key.

The container's file /etc/grafana/certs/cert_file are bind to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_key on the hosts and it's the same files on all the hosts but the certificates in /etc/pki/tls/certs/ceph_grafana.crt are different and correctly generated for each host.

If I copy /etc/pki/tls/certs/ceph_grafana.crt to /var/lib/ceph/d5c621ae-ec54-5b9d-910d-b8dba8e6b5ba/grafana.central-controller-*/etc/grafana/certs/cert_file and restart grafana containers on all hosts, The haproxy check starts to be successful.

This seems a side effect of the transitioning from ceph-ansible to cephadm: ceph-ansible used to configure the grafana containers via [1], and the template [2] reference the certificate generated for that node; also, the certificate was copied through [3], and /etc/grafana is mounted (-v /etc/grafana) when the container starts.
The above ensures the right certificate is always present in the current node where grafana is started.
However, cephadm is spec driven, and there's no logic to reference a diff certificate per instance because it's a config-key within the cluster [4], and it's global for all the grafana instances.
This is something that should be addressed by cephadm, just because you have the ability to deploy multiple grafana instances on multiple nodes, but not sure it's something currently supported.

[1] https://github.com/ceph/ceph-ansible/tree/main/roles/ceph-grafana
[2] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-grafana/templates/grafana.ini.j2#L19-L20
[3] https://github.com/ceph/ceph-ansible/blob/main/roles/ceph-grafana/tasks/configure_grafana.yml#L73-L95
[4] https://docs.ceph.com/en/latest/cephadm/services/monitoring/#configuring-ssl-tls-for-grafana

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Ernesto Puerta almost 2 years ago

Couldn't this be solved by generating wildcard certificates (*.grafana.<domain>) and some kind of hostname resolution (1.grafana.<domain>, 2.grafana.<domain>, ...)?

Maybe it's to reconsider/reevaluate some kind of internal FQDN/hostname addressing? Maybe the podman /etc/hosts management is now more consistent with Docker's than a year ago, or both have converged?

Actions

Copy link

Updated by Redouane Kachach Elhichou almost 2 years ago

I changed the cephadm code by the following PR:

https://github.com/ceph/ceph/pull/47098

to store the grafana cert/key per node. Now instead of using the same path mgr/cephadm/grafana_crt for all the nodes we store a different cert/key per each node using the path: mgr/cephadm/<hostname>/grafana_crt

Since I can't reproduce the issue on my env it would be great if you can get the code of this PR and test it on your env. I'll follow up with you with any support you need to test/fix the issues.

Actions

Copy link

Updated by Redouane Kachach Elhichou almost 2 years ago

Related to Documentation #47637: mgr/cephadm: document how to configure custom TLS certificate for Grafana added

Actions

Copy link

Updated by Redouane Kachach Elhichou almost 2 years ago

Status changed from New to Need More Info
Assignee set to Redouane Kachach Elhichou

Actions

Copy link

Updated by Ilya Dryomov almost 2 years ago

Target version changed from v16.2.10 to v16.2.11

Actions

Copy link

Updated by Francesco Pantano almost 2 years ago

Hi Redouane,
thanks for this change, just left a comment on the associated PR.
We don't have much cycles atm to help testing this particular change; however, if you can point me to a -pending/experimental ceph container that contains this fix I'll find some time to fix [1] accordingly and test your change with a TripleO deployed Ceph cluster.

[1] https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/roles/tripleo_cephadm/tasks/monitoring.yaml#L55

Actions

Copy link