Project

General

Profile

Feature #10029

Retry binding on IPv6 address if not available

Added by Wido den Hollander over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
msgr
Target version:
-
Start date:
11/07/2014
Due date:
% Done:

0%

Spent time:
Source:
other
Tags:
ipv6
Backport:
Reviewed:
Affected Versions:

Description

On systems with IPv6 it might be that the IPv6 address is not yet available when a MON or OSD boots.

This can have multiple causes:
  • DAD still in progress (Duplicate Address Detection)
  • SLAAC is still in progress (Stateless Autoconfiguration)

When an interface comes up it can take up to a couple of seconds before IPv6 connectivity is available or even an address is assigned to the interface.

systemd/upstart/sysvinit will start the daemons as soon as they think the network is ready, but it might be that IPv6 is not configured yet.

Monitors and OSDs will fail to start since they can't bind to a IPv6 socket and exit.

It would be usefull if the daemons would retry the binding again within a couple of seconds:

1. Try to bind
2. If it fails, wait 5 seconds
3. Try to bind again

We might add a short loop here where we have a configureable delay and number of retries, that would make it flexible and usefull for most situations.

This only applies to IPv6 though, so only when 'ms_bind_ipv6' is set to true.

Associated revisions

Revision 2d4dca75 (diff)
Added by Wido den Hollander over 3 years ago

SimpleMessenger: Retry binding on addresses if binding fails

If binding on a IP-Address fails, delay and retry again.

This happens mainly on IPv6 deployments. Due to DAD (Duplicate Address Detection)
or SLAAC it can be that IPv6 is not yet available when the daemons start.

Monitor daemons try to bind on a static IPv6 address and that might not be available
yet and that causes the monitor not to start.

Fixes: #10029

History

#1 Updated by Wido den Hollander over 3 years ago

I started playing with this a bit (no commits yet), I simply loop in SimpleMessenger's Accepter.cc and retry to bind a couple of times before giving up.

For IPv4 you have a net.ipv4.ip_nonlocal_bind, but that does not exist for IPv6.

A work-around would be to disable DAD on the interfaces, but that isn't the best way imho.

On the internet you find all kinds of posts where people run into this issue. It's not limited to Ceph, but the same goes for Nginx for example.

#2 Updated by Wido den Hollander over 3 years ago

Logs I'm seeing on a monitor when it boots:

2014-12-08 13:04:16.291838 7f1fd75ef7c0  0 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3), process ceph-mon, pid 1897
2014-12-08 13:04:16.473408 7f1fd75ef7c0  0 starting mon.srv-51d5-11 rank 1 at [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789/0 mon_data /var/lib/ceph/mon/ceph-srv-51d5-11 fsid ada2c7ae-2483-4428-a159-1a20fe2a579d
2014-12-08 13:04:16.473445 7f1fd75ef7c0 -1 accepter.accepter.bind unable to bind to [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789: (99) Cannot assign requested address
2014-12-08 13:04:16.473457 7f1fd75ef7c0 -1 unable to bind monitor to [XXXX:XXXX:1:1:ec4:7aff:fe1e:390e]:6789/0

#3 Updated by Samuel Just over 3 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF