Bug #51257: mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets - Orchestrator - Ceph

Actions

Copy link

Bug #51257

closed

mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets

Added by Aggelos Avgerinos almost 3 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Redouane Kachach Elhichou

Category:

orchestrator

Target version:

% Done:

Source:

Tags:

cephadm

Backport:

quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

46202

Crash signature (v1):

Crash signature (v2):

Description

In our network setup we have an IP (layer3) Fabric to the server using /128 IPv6 addresses³ and BGP to the server, in which case there is no notion of a layer2 domain in our infrastructure.

After bootstraping a cluster we tried to add mon daemons with $ ceph orch apply mon label:mon only to get the following message⁴ in mgr daemon logs:


Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph101: could not verify host allowed virtual ips
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph102: could not verify host allowed virtual ips

We took a look in cephadm's code a little bit:

cephadm manager module performs a check⁵, only when the deployed service is a MON daemon, to check if the network matches the public_network.
This check calls the matches_network function⁶ which is the place where things break for our setup.

Taking a closer look at the matches_network function⁶ we can see that:

def matches_network(host):
   # type: (str) -> bool
   if not public_network:
       return False
   # make sure we have 1 or more IPs for that network on that
   # host
   return len(self.mgr.cache.networks[host].get(public_network, [])) > 0

1) It will always return False if the public_network is unset.
2) It searches a cache⁷ inside the manager daemon to find at least 1 IP address on the defined public_network and fails if it doesn't

Though, even when we tried adding each separate /128 prefix for each node to the public_network variable, we still couldn't get the mon daemons to spin up with the same message in the logs.

We took a deeper look in the code to find out why while we have a matching public_network this still would not work:

The aforementioned cache⁷ fetches the networks_and_interfaces key for the affected host from the KV store⁸.

For our hosts we can see that the desired addresses are not correctly matched and stored in the KV store.
Let's take ceph101 for example:


root@ceph101:/# ip -6 a show dev ipfabric0
8: ipfabric0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet6 fd40:abcd::cef:101/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::18fa:ceff:fef8:7502/64 scope link
       valid_lft forever preferred_lft forever

root@ceph101:/# ceph config-key get mgr/cephadm/host.ceph101 | jq '.networks_and_interfaces'
{
 "fe80::/64": {
    "ens1f0": [
      "fe80::ae1f:6bff:fef8:de4e" 
    ],
    "ens1f1": [
      "fe80::ae1f:6bff:fef8:de4f" 
    ],
    "ipfabric0": [
      "fe80::18fa:ceff:fef8:7502" 
    ]
  }
}

As seen above, the ipfabric0 interface has a valid global IPv6 address fd42:abcd::cef:101/128 which is never detected by the tool that fills the cache,
Instead the matching address is a link-local⁹ one, which is irrelevant.

In order to find why our /128 IPv6 addresses got rejected by this tool we checked the specific function¹⁰ responsible for filling this cache. This function under the hood calls _parse_ipv6_route passing all routes and IP addresses it found in the system with ip -6 route ls and ip -6 addr ls respectively.

The routes passed to it are used to decide which networks is this host connected to and the snippet below rejects all routes without a subnet mask.

        if '/' not in net:  # only consider networks with a mask
            continue

This lead the orchestrator to reject all of our routes:


root@ceph101:/# ip -6 ro ls | grep '::cef'
fd42:abcd::cef:102 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:103 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:104 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:105 proto bird src fd42:abcd::cef:101 metric 512

To sum up:

cephadm assumes that all mon daemons are on the same layer2 domain¹, namely the public_network[2].
This assumption makes some network setups, like our own, to be unable to deploy mon daemons on different subnets.
Frankly, I can't see why cephadm should care about the underlying network topology when it comes to mon daemons.
The PR¹ that introduced makes it seem like some kind of a safeguard to avoid deploying mon daemons in unwanted hosts, but I believe this is the responsibility of the placement spec¹¹.

It would be nice to either change the logic to not rely on the system routes or even add a flag like mgr/cephadm/skip-mon-network-checks to bypass this restricted behavior.

[1]: https://github.com/ceph/ceph/pull/33952/
[2]: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#confval-public_network
[3]: https://datatracker.ietf.org/doc/html/rfc8273
[4]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/schedule.py#L307
[5]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L553
[6]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L539
[7]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L225
[8]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L312
[9]: https://en.wikipedia.org/wiki/Link-local_address
[10]: https://github.com/ceph/ceph/blob/v16.2.4/src/cephadm/cephadm#L4602
[11]: https://docs.ceph.com/en/latest/cephadm/service-management/#placement-specification

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sebastian Wagner almost 3 years ago

I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.

What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?

Actions

Copy link

Updated by Aggelos Avgerinos almost 3 years ago

Thanks for the quick reply.
Sebastian Wagner wrote:

I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.

Yes, I'll take a look as soon as possible.

What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?

That's the plan. To detect the `/32` and `/128` networks correctly and then use a CSV on public_network.
My only concern is that this mechanism now relies on the output of `ip route ls` instead of `ip addr`, which would make more sense to me.

Actions

Copy link

Updated by Jarad Olson about 2 years ago

To everyone monitoring this issue:
I'd been banging my head on this problem for a couple of days now and stumbled on this issue. Wanted to see if there'd been a pull request to fix it yet. If not, I can at least document what I've done so far to identify the problem in the code and what I've done to work around this issue.

There are actually two problems here:
1. The regex used to parse the output of `ip route ls` doesn't account for routes with a `via` keyword
That's easy enough to fix:
Before:

route_p = re.compile(r'^(\S+) dev (\S+) proto (\S+) metric (\S+) .*pref (\S+)$')

After:

route_p = re.compile(r'^(\S+) (?:via \S+)? ?dev (\S+) (?:proto (\S+))? ?metric (\S+) .*pref (\S+)$')

Ok. Now it will at least read the routes. Now for the next problem:

2. How the function matches the routes with interface addresses
Like Aggelos said, the function makes the assumption that all the mon daemons are on the same network. That's a naieve assumption to make, at best, and doesn't make a lot of sense for Ceph to really even care about. The assumption artificially limits Ceph's use-cases. I don't know how to fix this. There doesn't seem to be a flag that would allow me to bypass this either.

Aggelos, did you ever make a pull request for this?
Thanks!

Actions

Copy link

Updated by Redouane Kachach Elhichou about 2 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Redouane Kachach Elhichou about 2 years ago

Related to Bug #53496: cephadm: list-networks swallows /128 networks, breaking the orchestrator ("Filtered out host mon1: does not belong to mon public_network") added

Actions

Copy link