Project

General

Profile

Bug #21813

OSD bind to IPv6 link-local address

Added by Wido den Hollander over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Category:
OSD
Target version:
-
Start date:
10/16/2017
Due date:
% Done:

0%

Source:
Tags:
messenger,luminous,osd
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Just observed this behavior on a cluster when upgrading to Luminous:

osd.2 up   in  weight 1 up_from 175547 up_thru 175711 down_at 175546 last_clean_interval [175531,175545) [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6808/2302 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6828/1002302 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6828/1002302 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6829/1002302 exists,up 7bdbcb99-fd7f-4880-859a-9e54d26c96da
osd.5 up   in  weight 1 up_from 175700 up_thru 175712 down_at 175699 last_clean_interval [175527,175698) [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6800/1658 [2a04:XXX:1:5:ec4:7aff:fe1e:44c8]:6809/1001658 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6809/1001658 [fe80::ec4:7aff:fe1e:44c8%bond0.204]:6810/1001658 exists,up c3e13f69-43b6-4441-922b-aef5d2bfe262
osd.30 up   in  weight 1 up_from 175677 up_thru 175845 down_at 175676 last_clean_interval [175665,175675) [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6800/1662 [2a04:XXX:1:5:ec4:7aff:fe1e:3f3c]:6808/1001662 [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6808/1001662 [fe80::ec4:7aff:fe1e:3f3c%bond0.204]:6809/1001662 exists,up 3c1aeb5b-0ace-49cf-84f8-c85dfedd7c2f

In this case OSD 2, 5 and 30 bound to a Link-Local Ipv6 (fe80:XX:XX) address after they booted.

This is probably some form of race condition where the Unicast 2a04:X address isn't online yet but the OSDs boot.

These fe80 addresses should however not qualify as an address to bind on as they can't be routed thus breaks traffic.


Related issues

Copied to Ceph - Backport #23501: luminous: OSD bind to IPv6 link-local address Resolved

History

#1 Updated by Sage Weil about 1 year ago

  • Status changed from New to Feedback

We do allow binding to 127.0.0.1 (and do that frequently for vstart.sh for devs). Is it okay to only allow loopback testing on ipv4 and not on ipv6?

#2 Updated by Wido den Hollander about 1 year ago

Sage Weil wrote:

We do allow binding to 127.0.0.1 (and do that frequently for vstart.sh for devs). Is it okay to only allow loopback testing on ipv4 and not on ipv6?

Yes, you can allow localhost for IPv6, that would be ::1

But the link-local addresses are Layer 2 addresses and can't be routed. Those should not be selected. It could be made configurable, but it shouldn't be the default.

fe80::/10 is reserved for link-local with IPv6.

I've been looking to patch this, but I can't find the loop where the Messenger selects the available addresses.

#3 Updated by Wido den Hollander 12 months ago

I just noticed this again on a cluster which is running with IPv6 and Jewel:

Feb  2 13:42:19 ceph04 ceph-osd[3704]: 2018-02-02 13:42:19.287735 7f8ea9949700 -1 log_channel(cluster) log [ERR] : map e15776 had wrong cluster addr ([fe80::a236:9fff:fed8:54c0%enp5s0f1]:6801/3704 != my [2a05:XXXX:ff01:1:a236:9fff:fed8:54c0]:6801/3704)

The cluster in this case is running with StateLess Address Auto Configuration (SLAAC) and the IPs are not online after boot before the OSDs start, so they only find their link-local address.

We should simply ignore fe80:: addressess when selecting a IPv6 address.

#4 Updated by Kefu Chai 10 months ago

  • Status changed from Feedback to Need Review
  • Assignee set to Wido den Hollander
  • Backport set to luminous

#5 Updated by Kefu Chai 10 months ago

  • Status changed from Need Review to Pending Backport

#6 Updated by Nathan Cutler 10 months ago

  • Copied to Backport #23501: luminous: OSD bind to IPv6 link-local address added

#8 Updated by Nathan Cutler 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF