Bug #57910
openingress: HAProxy fails to start because keepalived IP address not yet available on new cluster
0%
Description
After deploying a new cluster sometimes HAProxy fails to start on ingress nodes:
Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : Starting frontend stats: cannot bind socket (Cannot assign requested address) [192.0.2.1:1967] Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : Starting frontend frontend: cannot bind socket (Cannot assign requested address) [192.0.2.1:443] Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : [haproxy.main()] Some protocols failed to start their listeners! Exiting. Oct 21 11:30:17 ingress02 systemd[1]: Started Ceph haproxy.rgw.pool.ingress02.bqqauq for 626a6e5e-5121-11ed-aa72-000c296ddf79.
It can be seen from the log that the keepalived container is starting only after the HAProxy container. So naturally the floating/keepalived IP address (192.0.2.1 here) is not available for HAProxy at this point.
Maybe the keepalived container should be started before HAProxy. This is most likely a side-effect/regression of https://tracker.ceph.com/issues/53684 since before HAProxy bound to any address and didn't require the keepalived IP address.
Updated by Voja Molani over 1 year ago
Happens also (sometimes?) after re-provisioning an ingress server. After OS installed and when cephadm configures the server for the first time it attempts to start HAProxy before keepalived and then HAProxy errors.
Updated by Redouane Kachach Elhichou over 1 year ago
As of the current design of the ingress service, the keepalived starts before haproxy because the daemon depends on the haproxy as it needs/uses it's health script for the alive check (see below). Thus, we can's start keepalived before haproxy. If the error reported above is persistent then this can be considered as a BUG, however, if the error is fixed (haproxy get deployed correctly) once keepalived is up&running then I think this shouldn't be an issue.
# script to monitor health script = '/usr/bin/false' for d in daemons: if d.hostname == host: if d.daemon_type == 'haproxy': assert d.ports port = d.ports[1] # monitoring port script = f'/usr/bin/curl {build_url(scheme="http", host=d.ip or "localhost", port=port)}/health' assert script
this generates the keepalived.conf section of:
vrrp_script check_backend { script "/usr/bin/curl http://localhost:9049/health" weight -20 interval 2 rise 2 fall 2 }
Updated by Voja Molani over 1 year ago
The HAProxy service does not start until it is manually started or the server is restarted.
The problem root is the recent change to bind HAProxy to the IP address that keepalived creates and not to "*", now a circular dependency has been created if HAProxy actually creates the keepalived configuration.
I am pretty sure I saw systemd trying to restart the HAProxy container a few times which of course errored immediately because of the missing IP address. Maybe systemd unit parameters for restarting could be tuned, to allow time for keepalived container to start meanwhile? Currently systemd just tries to restart it a few times but gives up because the unit exits immediately.
Updated by Voja Molani over 1 year ago
I believe this is duplicate of https://tracker.ceph.com/issues/57563