Bug #51257
closedmgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets
0%
Description
In our network setup we have an IP (layer3) Fabric to the server using /128
IPv6 addresses3 and BGP to the server, in which case there is no notion of a layer2 domain in our infrastructure.
After bootstraping a cluster we tried to add mon daemons with $ ceph orch apply mon label:mon
only to get the following message4 in mgr daemon logs:
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700 0 log_channel(cephadm) log [INF] : Filtered out host ceph101: could not verify host allowed virtual ips
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700 0 log_channel(cephadm) log [INF] : Filtered out host ceph102: could not verify host allowed virtual ips
We took a look in cephadm
's code a little bit:
cephadm
manager module performs a check5, only when the deployed service is a MON
daemon, to check if the network matches the public_network
.
This check calls the matches_network
function6 which is the place where things break for our setup.
Taking a closer look at the matches_network
function6 we can see that:
def matches_network(host):
# type: (str) -> bool
if not public_network:
return False
# make sure we have 1 or more IPs for that network on that
# host
return len(self.mgr.cache.networks[host].get(public_network, [])) > 0
1) It will always return False
if the public_network
is unset.
2) It searches a cache7 inside the manager daemon to find at least 1 IP address on the defined public_network
and fails if it doesn't
Though, even when we tried adding each separate /128
prefix for each node to the public_network
variable, we still couldn't get the mon daemons to spin up with the same message in the logs.
We took a deeper look in the code to find out why while we have a matching public_network
this still would not work:
The aforementioned cache7 fetches the networks_and_interfaces
key for the affected host from the KV store8.
For our hosts we can see that the desired addresses are not correctly matched and stored in the KV store.
Let's take ceph101
for example:
root@ceph101:/# ip -6 a show dev ipfabric0
8: ipfabric0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet6 fd40:abcd::cef:101/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::18fa:ceff:fef8:7502/64 scope link
valid_lft forever preferred_lft forever
root@ceph101:/# ceph config-key get mgr/cephadm/host.ceph101 | jq '.networks_and_interfaces'
{
"fe80::/64": {
"ens1f0": [
"fe80::ae1f:6bff:fef8:de4e"
],
"ens1f1": [
"fe80::ae1f:6bff:fef8:de4f"
],
"ipfabric0": [
"fe80::18fa:ceff:fef8:7502"
]
}
}
As seen above, the ipfabric0
interface has a valid global IPv6 address fd42:abcd::cef:101/128
which is never detected by the tool that fills the cache,
Instead the matching address is a link-local9 one, which is irrelevant.
In order to find why our /128
IPv6 addresses got rejected by this tool we checked the specific function10 responsible for filling this cache. This function under the hood calls _parse_ipv6_route
passing all routes and IP addresses it found in the system with ip -6 route ls
and ip -6 addr ls
respectively.
The routes passed to it are used to decide which networks is this host connected to and the snippet below rejects all routes without a subnet mask.
if '/' not in net: # only consider networks with a mask
continue
This lead the orchestrator to reject all of our routes:
root@ceph101:/# ip -6 ro ls | grep '::cef'
fd42:abcd::cef:102 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:103 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:104 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:105 proto bird src fd42:abcd::cef:101 metric 512
To sum up:
cephadm
assumes that all mon daemons are on the same layer2 domain1, namely the public_network
[2].
This assumption makes some network setups, like our own, to be unable to deploy mon daemons on different subnets.
Frankly, I can't see why cephadm
should care about the underlying network topology when it comes to mon
daemons.
The PR1 that introduced makes it seem like some kind of a safeguard to avoid deploying mon daemons in unwanted hosts, but I believe this is the responsibility of the placement spec11.
It would be nice to either change the logic to not rely on the system routes or even add a flag like mgr/cephadm/skip-mon-network-checks
to bypass this restricted behavior.
[1]: https://github.com/ceph/ceph/pull/33952/
[2]: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#confval-public_network
[3]: https://datatracker.ietf.org/doc/html/rfc8273
[4]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/schedule.py#L307
[5]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L553
[6]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L539
[7]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L225
[8]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L312
[9]: https://en.wikipedia.org/wiki/Link-local_address
[10]: https://github.com/ceph/ceph/blob/v16.2.4/src/cephadm/cephadm#L4602
[11]: https://docs.ceph.com/en/latest/cephadm/service-management/#placement-specification
Updated by Sebastian Wagner almost 3 years ago
I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.
What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?
Updated by Aggelos Avgerinos almost 3 years ago
Thanks for the quick reply.
Sebastian Wagner wrote:
I see this as a valid bug that needs a fix. Aggelos, by far the fastest way to fix this would be for you to create a pull request.
Yes, I'll take a look as soon as possible.
What's important is that you properly set public_network. Note that public_network can be a comma-separated list of networks. If you set public_network and would create a PR to properly fill the cache of networks, would that solve your issue?
That's the plan. To detect the `/32` and `/128` networks correctly and then use a CSV on public_network
.
My only concern is that this mechanism now relies on the output of `ip route ls` instead of `ip addr`, which would make more sense to me.
Updated by Jarad Olson about 2 years ago
To everyone monitoring this issue:
I'd been banging my head on this problem for a couple of days now and stumbled on this issue. Wanted to see if there'd been a pull request to fix it yet. If not, I can at least document what I've done so far to identify the problem in the code and what I've done to work around this issue.
There are actually two problems here:
1. The regex used to parse the output of `ip route ls` doesn't account for routes with a `via` keyword
That's easy enough to fix:
Before:
route_p = re.compile(r'^(\S+) dev (\S+) proto (\S+) metric (\S+) .*pref (\S+)$')
After:
route_p = re.compile(r'^(\S+) (?:via \S+)? ?dev (\S+) (?:proto (\S+))? ?metric (\S+) .*pref (\S+)$')
Ok. Now it will at least read the routes. Now for the next problem:
2. How the function matches the routes with interface addresses
Like Aggelos said, the function makes the assumption that all the mon daemons are on the same network. That's a naieve assumption to make, at best, and doesn't make a lot of sense for Ceph to really even care about. The assumption artificially limits Ceph's use-cases. I don't know how to fix this. There doesn't seem to be a flag that would allow me to bypass this either.
Aggelos, did you ever make a pull request for this?
Thanks!
Updated by Redouane Kachach Elhichou about 2 years ago
- Status changed from New to Need More Info
Updated by Redouane Kachach Elhichou about 2 years ago
- Related to Bug #53496: cephadm: list-networks swallows /128 networks, breaking the orchestrator ("Filtered out host mon1: does not belong to mon public_network") added
Updated by Redouane Kachach Elhichou about 2 years ago
- Assignee set to Redouane Kachach Elhichou
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from Need More Info to In Progress
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 46202
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Redouane Kachach Elhichou almost 2 years ago
- Backport set to quincy,pacific
Updated by Redouane Kachach Elhichou almost 2 years ago
- Status changed from Pending Backport to Resolved