Project

General

Profile

Bug #49938

daemons bind to loopback iface

Added by Dan van der Ster 23 days ago. Updated 6 days ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There seems to be a regression in 14.2.18 whereby in some envs OSDs will bind to 127.0.0.1.

E.g. https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3Z5J7MYZIPM3ZUTNU4LTWADXOSZVK27R/

This was probably introduced in https://github.com/ceph/ceph/commit/89321762ad4cfdd1a68cae467181bdd1a501f14d

I don't think ifa_name contains a colon.. on my machine I tested the example code at https://man7.org/linux/man-pages/man3/getifaddrs.3.html and it outputs just `lo`

# ./a.out
lo       AF_PACKET (17)
                tx_packets = 1683333517; rx_packets = 1683333517
                tx_bytes   = 1685898949; rx_bytes   = 1685898949
eno1     AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
ens785f0 AF_PACKET (17)
                tx_packets = 3787675362; rx_packets = 4154015233
                tx_bytes   = 3146993958; rx_bytes   = 1004572644
ens785f1 AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
eno2     AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
lo       AF_INET (2)
                address: <127.0.0.1>
ens785f0 AF_INET (2)
                address: <10.116.6.8>
lo       AF_INET6 (10)
                address: <::1>
ens785f0 AF_INET6 (10)
                address: <fd01:1458:e00:1e::100:5>
ens785f0 AF_INET6 (10)
                address: <fe80::bdbd:76be:63fd:a4c2%ens785f0>

So we need to also explicitly skip when the iface name is exactly 'lo'.

Marking this with critical because it can take down entire clusters if operators yum update.


Related issues

Related to Ceph - Bug #50012: Ceph-osd refuses to bind on an IP on the local loopback lo (again) Fix Under Review
Related to Ceph - Bug #43417: Since the local loopback address is set to a virtual IP,OSD can't restart . Resolved
Related to Ceph - Bug #48893: Ceph-osd refuses to bind on an IP on the local loopback lo Resolved
Copied to Ceph - Backport #49995: octopus: daemons bind to loopback iface Resolved
Copied to Ceph - Backport #49996: nautilus: daemons bind to loopback iface Resolved
Copied to Ceph - Backport #49997: pacific: daemons bind to loopback iface Resolved

History

#1 Updated by Dan van der Ster 23 days ago

  • Status changed from New to Fix Under Review
  • Assignee set to Dan van der Ster
  • Pull request ID set to 40334

#2 Updated by Dan van der Ster 23 days ago

I suppose this will re-break the use-case described in #48893.

I would argue that OOTB, ceph should do the right thing on the most common deployments. But if we want to support this bgp-to-the-host use-case ootb also, the heuristic to pick addrs needs to be improved further.

#4 Updated by Stefan Kooman 21 days ago

I agree with Dan that a 14.2.19 should be release ASAP to fix this issue. Otherwise this will impact many more clusters I'm afraid.

#5 Updated by Neha Ojha 21 days ago

  • Backport set to pacific, octopus, nautilus

#6 Updated by Kefu Chai 20 days ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Backport Bot 20 days ago

#8 Updated by Backport Bot 20 days ago

  • Copied to Backport #49996: nautilus: daemons bind to loopback iface added

#9 Updated by Backport Bot 20 days ago

#10 Updated by Kefu Chai 20 days ago

  • Related to Bug #50012: Ceph-osd refuses to bind on an IP on the local loopback lo (again) added

#11 Updated by Nathan Cutler 15 days ago

  • Related to Bug #43417: Since the local loopback address is set to a virtual IP,OSD can't restart . added

#12 Updated by Nathan Cutler 15 days ago

  • Related to Bug #48893: Ceph-osd refuses to bind on an IP on the local loopback lo added

#13 Updated by Loïc Dachary 6 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF