Bug #49938
daemons bind to loopback iface
0%
Description
There seems to be a regression in 14.2.18 whereby in some envs OSDs will bind to 127.0.0.1.
E.g. https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3Z5J7MYZIPM3ZUTNU4LTWADXOSZVK27R/
This was probably introduced in https://github.com/ceph/ceph/commit/89321762ad4cfdd1a68cae467181bdd1a501f14d
I don't think ifa_name contains a colon.. on my machine I tested the example code at https://man7.org/linux/man-pages/man3/getifaddrs.3.html and it outputs just `lo`
# ./a.out lo AF_PACKET (17) tx_packets = 1683333517; rx_packets = 1683333517 tx_bytes = 1685898949; rx_bytes = 1685898949 eno1 AF_PACKET (17) tx_packets = 0; rx_packets = 0 tx_bytes = 0; rx_bytes = 0 ens785f0 AF_PACKET (17) tx_packets = 3787675362; rx_packets = 4154015233 tx_bytes = 3146993958; rx_bytes = 1004572644 ens785f1 AF_PACKET (17) tx_packets = 0; rx_packets = 0 tx_bytes = 0; rx_bytes = 0 eno2 AF_PACKET (17) tx_packets = 0; rx_packets = 0 tx_bytes = 0; rx_bytes = 0 lo AF_INET (2) address: <127.0.0.1> ens785f0 AF_INET (2) address: <10.116.6.8> lo AF_INET6 (10) address: <::1> ens785f0 AF_INET6 (10) address: <fd01:1458:e00:1e::100:5> ens785f0 AF_INET6 (10) address: <fe80::bdbd:76be:63fd:a4c2%ens785f0>
So we need to also explicitly skip when the iface name is exactly 'lo'.
Marking this with critical because it can take down entire clusters if operators yum update.
Related issues
History
#1 Updated by Dan van der Ster about 1 year ago
- Status changed from New to Fix Under Review
- Assignee set to Dan van der Ster
- Pull request ID set to 40334
#2 Updated by Dan van der Ster about 1 year ago
I suppose this will re-break the use-case described in #48893.
I would argue that OOTB, ceph should do the right thing on the most common deployments. But if we want to support this bgp-to-the-host use-case ootb also, the heuristic to pick addrs needs to be improved further.
#3 Updated by Dan van der Ster about 1 year ago
All daemons are impacted by this, not just OSDs: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IAGFUXMRZU77M4KYS5NW5MZ6YJ7YN4G/
#4 Updated by Stefan Kooman about 1 year ago
I agree with Dan that a 14.2.19 should be release ASAP to fix this issue. Otherwise this will impact many more clusters I'm afraid.
#5 Updated by Neha Ojha about 1 year ago
- Backport set to pacific, octopus, nautilus
#6 Updated by Kefu Chai about 1 year ago
- Status changed from Fix Under Review to Pending Backport
#7 Updated by Backport Bot about 1 year ago
- Copied to Backport #49995: octopus: daemons bind to loopback iface added
#8 Updated by Backport Bot about 1 year ago
- Copied to Backport #49996: nautilus: daemons bind to loopback iface added
#9 Updated by Backport Bot about 1 year ago
- Copied to Backport #49997: pacific: daemons bind to loopback iface added
#10 Updated by Kefu Chai about 1 year ago
- Related to Bug #50012: Ceph-osd refuses to bind on an IP on the local loopback lo (again) added
#11 Updated by Nathan Cutler about 1 year ago
- Related to Bug #43417: Since the local loopback address is set to a virtual IP,OSD can't restart . added
#12 Updated by Nathan Cutler about 1 year ago
- Related to Bug #48893: Ceph-osd refuses to bind on an IP on the local loopback lo added
#13 Updated by Loïc Dachary about 1 year ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".