Project

General

Profile

Bug #49938

daemons bind to loopback iface

Added by Dan van der Ster 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus, nautilus
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

There seems to be a regression in 14.2.18 whereby in some envs OSDs will bind to 127.0.0.1.

E.g. https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3Z5J7MYZIPM3ZUTNU4LTWADXOSZVK27R/

This was probably introduced in https://github.com/ceph/ceph/commit/89321762ad4cfdd1a68cae467181bdd1a501f14d

I don't think ifa_name contains a colon.. on my machine I tested the example code at https://man7.org/linux/man-pages/man3/getifaddrs.3.html and it outputs just `lo`

# ./a.out
lo       AF_PACKET (17)
                tx_packets = 1683333517; rx_packets = 1683333517
                tx_bytes   = 1685898949; rx_bytes   = 1685898949
eno1     AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
ens785f0 AF_PACKET (17)
                tx_packets = 3787675362; rx_packets = 4154015233
                tx_bytes   = 3146993958; rx_bytes   = 1004572644
ens785f1 AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
eno2     AF_PACKET (17)
                tx_packets =          0; rx_packets =          0
                tx_bytes   =          0; rx_bytes   =          0
lo       AF_INET (2)
                address: <127.0.0.1>
ens785f0 AF_INET (2)
                address: <10.116.6.8>
lo       AF_INET6 (10)
                address: <::1>
ens785f0 AF_INET6 (10)
                address: <fd01:1458:e00:1e::100:5>
ens785f0 AF_INET6 (10)
                address: <fe80::bdbd:76be:63fd:a4c2%ens785f0>

So we need to also explicitly skip when the iface name is exactly 'lo'.

Marking this with critical because it can take down entire clusters if operators yum update.


Related issues

Related to RADOS - Bug #50012: Ceph-osd refuses to bind on an IP on the local loopback lo (again) Fix Under Review
Related to Ceph - Bug #43417: Since the local loopback address is set to a virtual IP,OSD can't restart . Resolved
Related to Ceph - Bug #48893: Ceph-osd refuses to bind on an IP on the local loopback lo Resolved
Copied to Ceph - Backport #49995: octopus: daemons bind to loopback iface Resolved
Copied to Ceph - Backport #49996: nautilus: daemons bind to loopback iface Resolved
Copied to Ceph - Backport #49997: pacific: daemons bind to loopback iface Resolved

History

#1 Updated by Dan van der Ster 4 months ago

  • Status changed from New to Fix Under Review
  • Assignee set to Dan van der Ster
  • Pull request ID set to 40334

#2 Updated by Dan van der Ster 4 months ago

I suppose this will re-break the use-case described in #48893.

I would argue that OOTB, ceph should do the right thing on the most common deployments. But if we want to support this bgp-to-the-host use-case ootb also, the heuristic to pick addrs needs to be improved further.

#4 Updated by Stefan Kooman 4 months ago

I agree with Dan that a 14.2.19 should be release ASAP to fix this issue. Otherwise this will impact many more clusters I'm afraid.

#5 Updated by Neha Ojha 4 months ago

  • Backport set to pacific, octopus, nautilus

#6 Updated by Kefu Chai 4 months ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Backport Bot 4 months ago

#8 Updated by Backport Bot 4 months ago

  • Copied to Backport #49996: nautilus: daemons bind to loopback iface added

#9 Updated by Backport Bot 4 months ago

#10 Updated by Kefu Chai 4 months ago

  • Related to Bug #50012: Ceph-osd refuses to bind on an IP on the local loopback lo (again) added

#11 Updated by Nathan Cutler 4 months ago

  • Related to Bug #43417: Since the local loopback address is set to a virtual IP,OSD can't restart . added

#12 Updated by Nathan Cutler 4 months ago

  • Related to Bug #48893: Ceph-osd refuses to bind on an IP on the local loopback lo added

#13 Updated by Loïc Dachary 4 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF