Bug #51257: mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets - Orchestrator - Ceph

Actions

Copy link

Bug #51257

closed

mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets

Added by Aggelos Avgerinos almost 3 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Redouane Kachach Elhichou

Category:

orchestrator

Target version:

% Done:

Source:

Tags:

cephadm

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

46202

Crash signature (v1):

Crash signature (v2):

Description

In our network setup we have an IP (layer3) Fabric to the server using /128 IPv6 addresses³ and BGP to the server, in which case there is no notion of a layer2 domain in our infrastructure.

After bootstraping a cluster we tried to add mon daemons with $ ceph orch apply mon label:mon only to get the following message⁴ in mgr daemon logs:


Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph101: could not verify host allowed virtual ips
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph102: could not verify host allowed virtual ips

We took a look in cephadm's code a little bit:

cephadm manager module performs a check⁵, only when the deployed service is a MON daemon, to check if the network matches the public_network.
This check calls the matches_network function⁶ which is the place where things break for our setup.

Taking a closer look at the matches_network function⁶ we can see that:

def matches_network(host):
   # type: (str) -> bool
   if not public_network:
       return False
   # make sure we have 1 or more IPs for that network on that
   # host
   return len(self.mgr.cache.networks[host].get(public_network, [])) > 0

1) It will always return False if the public_network is unset.
2) It searches a cache⁷ inside the manager daemon to find at least 1 IP address on the defined public_network and fails if it doesn't

Though, even when we tried adding each separate /128 prefix for each node to the public_network variable, we still couldn't get the mon daemons to spin up with the same message in the logs.

We took a deeper look in the code to find out why while we have a matching public_network this still would not work:

The aforementioned cache⁷ fetches the networks_and_interfaces key for the affected host from the KV store⁸.

For our hosts we can see that the desired addresses are not correctly matched and stored in the KV store.
Let's take ceph101 for example:


root@ceph101:/# ip -6 a show dev ipfabric0
8: ipfabric0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet6 fd40:abcd::cef:101/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::18fa:ceff:fef8:7502/64 scope link
       valid_lft forever preferred_lft forever

root@ceph101:/# ceph config-key get mgr/cephadm/host.ceph101 | jq '.networks_and_interfaces'
{
 "fe80::/64": {
    "ens1f0": [
      "fe80::ae1f:6bff:fef8:de4e" 
    ],
    "ens1f1": [
      "fe80::ae1f:6bff:fef8:de4f" 
    ],
    "ipfabric0": [
      "fe80::18fa:ceff:fef8:7502" 
    ]
  }
}

As seen above, the ipfabric0 interface has a valid global IPv6 address fd42:abcd::cef:101/128 which is never detected by the tool that fills the cache,
Instead the matching address is a link-local⁹ one, which is irrelevant.

In order to find why our /128 IPv6 addresses got rejected by this tool we checked the specific function¹⁰ responsible for filling this cache. This function under the hood calls _parse_ipv6_route passing all routes and IP addresses it found in the system with ip -6 route ls and ip -6 addr ls respectively.

The routes passed to it are used to decide which networks is this host connected to and the snippet below rejects all routes without a subnet mask.

        if '/' not in net:  # only consider networks with a mask
            continue

This lead the orchestrator to reject all of our routes:


root@ceph101:/# ip -6 ro ls | grep '::cef'
fd42:abcd::cef:102 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:103 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:104 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:105 proto bird src fd42:abcd::cef:101 metric 512

To sum up:

cephadm assumes that all mon daemons are on the same layer2 domain¹, namely the public_network[2].
This assumption makes some network setups, like our own, to be unable to deploy mon daemons on different subnets.
Frankly, I can't see why cephadm should care about the underlying network topology when it comes to mon daemons.
The PR¹ that introduced makes it seem like some kind of a safeguard to avoid deploying mon daemons in unwanted hosts, but I believe this is the responsibility of the placement spec¹¹.

It would be nice to either change the logic to not rely on the system routes or even add a flag like mgr/cephadm/skip-mon-network-checks to bypass this restricted behavior.

[1]: https://github.com/ceph/ceph/pull/33952/
[2]: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#confval-public_network
[3]: https://datatracker.ietf.org/doc/html/rfc8273
[4]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/schedule.py#L307
[5]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L553
[6]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L539
[7]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L225
[8]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L312
[9]: https://en.wikipedia.org/wiki/Link-local_address
[10]: https://github.com/ceph/ceph/blob/v16.2.4/src/cephadm/cephadm#L4602
[11]: https://docs.ceph.com/en/latest/cephadm/service-management/#placement-specification

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Orchestrator

Custom queries

Bug #51257

mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets

Updated by Sebastian Wagner almost 3 years ago

Updated by Aggelos Avgerinos almost 3 years ago

Updated by Jarad Olson about 2 years ago

Updated by Redouane Kachach Elhichou about 2 years ago

Updated by Redouane Kachach Elhichou about 2 years ago

Updated by Redouane Kachach Elhichou about 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago

Updated by Redouane Kachach Elhichou almost 2 years ago