Project

General

Profile

Actions

Bug #51257

closed

mgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets

Added by Aggelos Avgerinos almost 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Category:
orchestrator
Target version:
-
% Done:

0%

Source:
Tags:
cephadm
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In our network setup we have an IP (layer3) Fabric to the server using /128 IPv6 addresses3 and BGP to the server, in which case there is no notion of a layer2 domain in our infrastructure.

After bootstraping a cluster we tried to add mon daemons with $ ceph orch apply mon label:mon only to get the following message4 in mgr daemon logs:


Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph101: could not verify host allowed virtual ips
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700  0 log_channel(cephadm) log [INF] : Filtered out host ceph102: could not verify host allowed virtual ips

We took a look in cephadm's code a little bit:

cephadm manager module performs a check5, only when the deployed service is a MON daemon, to check if the network matches the public_network.
This check calls the matches_network function6 which is the place where things break for our setup.

Taking a closer look at the matches_network function6 we can see that:

def matches_network(host):
   # type: (str) -> bool
   if not public_network:
       return False
   # make sure we have 1 or more IPs for that network on that
   # host
   return len(self.mgr.cache.networks[host].get(public_network, [])) > 0

1) It will always return False if the public_network is unset.
2) It searches a cache7 inside the manager daemon to find at least 1 IP address on the defined public_network and fails if it doesn't

Though, even when we tried adding each separate /128 prefix for each node to the public_network variable, we still couldn't get the mon daemons to spin up with the same message in the logs.

We took a deeper look in the code to find out why while we have a matching public_network this still would not work:

The aforementioned cache7 fetches the networks_and_interfaces key for the affected host from the KV store8.

For our hosts we can see that the desired addresses are not correctly matched and stored in the KV store.
Let's take ceph101 for example:


root@ceph101:/# ip -6 a show dev ipfabric0
8: ipfabric0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    inet6 fd40:abcd::cef:101/128 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::18fa:ceff:fef8:7502/64 scope link
       valid_lft forever preferred_lft forever

root@ceph101:/# ceph config-key get mgr/cephadm/host.ceph101 | jq '.networks_and_interfaces'
{
 "fe80::/64": {
    "ens1f0": [
      "fe80::ae1f:6bff:fef8:de4e" 
    ],
    "ens1f1": [
      "fe80::ae1f:6bff:fef8:de4f" 
    ],
    "ipfabric0": [
      "fe80::18fa:ceff:fef8:7502" 
    ]
  }
}

As seen above, the ipfabric0 interface has a valid global IPv6 address fd42:abcd::cef:101/128 which is never detected by the tool that fills the cache,
Instead the matching address is a link-local9 one, which is irrelevant.

In order to find why our /128 IPv6 addresses got rejected by this tool we checked the specific function10 responsible for filling this cache. This function under the hood calls _parse_ipv6_route passing all routes and IP addresses it found in the system with ip -6 route ls and ip -6 addr ls respectively.

The routes passed to it are used to decide which networks is this host connected to and the snippet below rejects all routes without a subnet mask.

        if '/' not in net:  # only consider networks with a mask
            continue

This lead the orchestrator to reject all of our routes:


root@ceph101:/# ip -6 ro ls | grep '::cef'
fd42:abcd::cef:102 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:103 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:104 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:105 proto bird src fd42:abcd::cef:101 metric 512

To sum up:

cephadm assumes that all mon daemons are on the same layer2 domain1, namely the public_network[2].
This assumption makes some network setups, like our own, to be unable to deploy mon daemons on different subnets.
Frankly, I can't see why cephadm should care about the underlying network topology when it comes to mon daemons.
The PR1 that introduced makes it seem like some kind of a safeguard to avoid deploying mon daemons in unwanted hosts, but I believe this is the responsibility of the placement spec11.

It would be nice to either change the logic to not rely on the system routes or even add a flag like mgr/cephadm/skip-mon-network-checks to bypass this restricted behavior.

[1]: https://github.com/ceph/ceph/pull/33952/
[2]: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#confval-public_network
[3]: https://datatracker.ietf.org/doc/html/rfc8273
[4]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/schedule.py#L307
[5]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L553
[6]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L539
[7]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L225
[8]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L312
[9]: https://en.wikipedia.org/wiki/Link-local_address
[10]: https://github.com/ceph/ceph/blob/v16.2.4/src/cephadm/cephadm#L4602
[11]: https://docs.ceph.com/en/latest/cephadm/service-management/#placement-specification


Related issues 1 (0 open1 closed)

Related to Orchestrator - Bug #53496: cephadm: list-networks swallows /128 networks, breaking the orchestrator ("Filtered out host mon1: does not belong to mon public_network")Resolved

Actions
Actions #1

Updated by Sebastian Wagner almost 3 years ago

I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.

What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?

Actions #2

Updated by Aggelos Avgerinos almost 3 years ago

Thanks for the quick reply.
Sebastian Wagner wrote:

I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.

Yes, I'll take a look as soon as possible.

What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?

That's the plan. To detect the `/32` and `/128` networks correctly and then use a CSV on public_network.
My only concern is that this mechanism now relies on the output of `ip route ls` instead of `ip addr`, which would make more sense to me.

Actions #3

Updated by Jarad Olson about 2 years ago

To everyone monitoring this issue:
I'd been banging my head on this problem for a couple of days now and stumbled on this issue. Wanted to see if there'd been a pull request to fix it yet. If not, I can at least document what I've done so far to identify the problem in the code and what I've done to work around this issue.

There are actually two problems here:
1. The regex used to parse the output of `ip route ls` doesn't account for routes with a `via` keyword
That's easy enough to fix:
Before:

route_p = re.compile(r'^(\S+) dev (\S+) proto (\S+) metric (\S+) .*pref (\S+)$')

After:

route_p = re.compile(r'^(\S+) (?:via \S+)? ?dev (\S+) (?:proto (\S+))? ?metric (\S+) .*pref (\S+)$')

Ok. Now it will at least read the routes. Now for the next problem:

2. How the function matches the routes with interface addresses
Like Aggelos said, the function makes the assumption that all the mon daemons are on the same network. That's a naieve assumption to make, at best, and doesn't make a lot of sense for Ceph to really even care about. The assumption artificially limits Ceph's use-cases. I don't know how to fix this. There doesn't seem to be a flag that would allow me to bypass this either.

Aggelos, did you ever make a pull request for this?
Thanks!

Actions #4

Updated by Redouane Kachach Elhichou about 2 years ago

  • Status changed from New to Need More Info
Actions #5

Updated by Redouane Kachach Elhichou about 2 years ago

  • Related to Bug #53496: cephadm: list-networks swallows /128 networks, breaking the orchestrator ("Filtered out host mon1: does not belong to mon public_network") added
Actions #6

Updated by Redouane Kachach Elhichou about 2 years ago

  • Assignee set to Redouane Kachach Elhichou
Actions #7

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Status changed from Need More Info to In Progress
Actions #8

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 46202
Actions #9

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Backport set to quincy,pacific
Actions #11

Updated by Redouane Kachach Elhichou almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF