Bug #57910: ingress: HAProxy fails to start because keepalived IP address not yet available on new cluster - Orchestrator - Ceph

Actions

Copy link

Bug #57910

open

ingress: HAProxy fails to start because keepalived IP address not yet available on new cluster

Added by Voja Molani over 1 year ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Adam King

Category:

cephadm

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.5

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After deploying a new cluster sometimes HAProxy fails to start on ingress nodes:

Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : Starting frontend stats: cannot bind socket (Cannot assign requested address) [192.0.2.1:1967]
Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : Starting frontend frontend: cannot bind socket (Cannot assign requested address) [192.0.2.1:443]
Oct 21 11:30:17 ingress02 ceph-626a6e5e-5121-11ed-aa72-000c296ddf79-haproxy-rgw-pool-ingress02-bqqauq[48021]: [ALERT] 293/093017 (2) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
Oct 21 11:30:17 ingress02 systemd[1]: Started Ceph haproxy.rgw.pool.ingress02.bqqauq for 626a6e5e-5121-11ed-aa72-000c296ddf79.

It can be seen from the log that the keepalived container is starting only after the HAProxy container. So naturally the floating/keepalived IP address (192.0.2.1 here) is not available for HAProxy at this point.
Maybe the keepalived container should be started before HAProxy. This is most likely a side-effect/regression of https://tracker.ceph.com/issues/53684 since before HAProxy bound to any address and didn't require the keepalived IP address.

Actions

Copy link

Updated by Adam King over 1 year ago

Assignee set to Adam King

Actions

Copy link

Updated by Voja Molani over 1 year ago

Happens also (sometimes?) after re-provisioning an ingress server. After OS installed and when cephadm configures the server for the first time it attempts to start HAProxy before keepalived and then HAProxy errors.

Actions

Copy link

Updated by Redouane Kachach Elhichou over 1 year ago

As of the current design of the ingress service, the keepalived starts before haproxy because the daemon depends on the haproxy as it needs/uses it's health script for the alive check (see below). Thus, we can's start keepalived before haproxy. If the error reported above is persistent then this can be considered as a BUG, however, if the error is fixed (haproxy get deployed correctly) once keepalived is up&running then I think this shouldn't be an issue.


        # script to monitor health
        script = '/usr/bin/false'
        for d in daemons:
            if d.hostname == host:
                if d.daemon_type == 'haproxy':
                    assert d.ports
                    port = d.ports[1]   # monitoring port
                    script = f'/usr/bin/curl {build_url(scheme="http", host=d.ip or "localhost", port=port)}/health'
        assert script

this generates the keepalived.conf section of:

vrrp_script check_backend {
    script "/usr/bin/curl http://localhost:9049/health" 
    weight -20
    interval 2
    rise 2
    fall 2
}

Actions

Copy link

Updated by Voja Molani over 1 year ago

The HAProxy service does not start until it is manually started or the server is restarted.

The problem root is the recent change to bind HAProxy to the IP address that keepalived creates and not to "*", now a circular dependency has been created if HAProxy actually creates the keepalived configuration.

I am pretty sure I saw systemd trying to restart the HAProxy container a few times which of course errored immediately because of the missing IP address. Maybe systemd unit parameters for restarting could be tuned, to allow time for keepalived container to start meanwhile? Currently systemd just tries to restart it a few times but gives up because the unit exits immediately.

Actions

Copy link