Retry binding on IPv6 address if not available
On systems with IPv6 it might be that the IPv6 address is not yet available when a MON or OSD boots.This can have multiple causes:
- DAD still in progress (Duplicate Address Detection)
- SLAAC is still in progress (Stateless Autoconfiguration)
When an interface comes up it can take up to a couple of seconds before IPv6 connectivity is available or even an address is assigned to the interface.
systemd/upstart/sysvinit will start the daemons as soon as they think the network is ready, but it might be that IPv6 is not configured yet.
Monitors and OSDs will fail to start since they can't bind to a IPv6 socket and exit.
It would be usefull if the daemons would retry the binding again within a couple of seconds:
1. Try to bind
2. If it fails, wait 5 seconds
3. Try to bind again
We might add a short loop here where we have a configureable delay and number of retries, that would make it flexible and usefull for most situations.
This only applies to IPv6 though, so only when 'ms_bind_ipv6' is set to true.
SimpleMessenger: Retry binding on addresses if binding fails
If binding on a IP-Address fails, delay and retry again.
This happens mainly on IPv6 deployments. Due to DAD (Duplicate Address Detection)
or SLAAC it can be that IPv6 is not yet available when the daemons start.
Monitor daemons try to bind on a static IPv6 address and that might not be available
yet and that causes the monitor not to start.
#1 Updated by Wido den Hollander over 4 years ago
I started playing with this a bit (no commits yet), I simply loop in SimpleMessenger's Accepter.cc and retry to bind a couple of times before giving up.
For IPv4 you have a net.ipv4.ip_nonlocal_bind, but that does not exist for IPv6.
A work-around would be to disable DAD on the interfaces, but that isn't the best way imho.
On the internet you find all kinds of posts where people run into this issue. It's not limited to Ceph, but the same goes for Nginx for example.
#2 Updated by Wido den Hollander about 4 years ago
Logs I'm seeing on a monitor when it boots:
2014-12-08 13:04:16.291838 7f1fd75ef7c0 0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-mon, pid 1897 2014-12-08 13:04:16.473408 7f1fd75ef7c0 0 starting mon.srv-51d5-11 rank 1 at [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789/0 mon_data /var/lib/ceph/mon/ceph-srv-51d5-11 fsid ada2c7ae-2483-4428-a159-1a20fe2a579d 2014-12-08 13:04:16.473445 7f1fd75ef7c0 -1 accepter.accepter.bind unable to bind to [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789: (99) Cannot assign requested address 2014-12-08 13:04:16.473457 7f1fd75ef7c0 -1 unable to bind monitor to [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789/0