Bug #51257
closedmgr/cephadm: Cannot add managed (ceph apply) mon daemons on different subnets
0%
Description
In our network setup we have an IP (layer3) Fabric to the server using /128
IPv6 addresses3 and BGP to the server, in which case there is no notion of a layer2 domain in our infrastructure.
After bootstraping a cluster we tried to add mon daemons with $ ceph orch apply mon label:mon
only to get the following message4 in mgr daemon logs:
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700 0 log_channel(cephadm) log [INF] : Filtered out host ceph101: could not verify host allowed virtual ips
Jun 14 18:18:28 ceph101 conmon[87926]: debug 2021-06-14T15:18:28.755+0000 7fd61fa18700 0 log_channel(cephadm) log [INF] : Filtered out host ceph102: could not verify host allowed virtual ips
We took a look in cephadm
's code a little bit:
cephadm
manager module performs a check5, only when the deployed service is a MON
daemon, to check if the network matches the public_network
.
This check calls the matches_network
function6 which is the place where things break for our setup.
Taking a closer look at the matches_network
function6 we can see that:
def matches_network(host):
# type: (str) -> bool
if not public_network:
return False
# make sure we have 1 or more IPs for that network on that
# host
return len(self.mgr.cache.networks[host].get(public_network, [])) > 0
1) It will always return False
if the public_network
is unset.
2) It searches a cache7 inside the manager daemon to find at least 1 IP address on the defined public_network
and fails if it doesn't
Though, even when we tried adding each separate /128
prefix for each node to the public_network
variable, we still couldn't get the mon daemons to spin up with the same message in the logs.
We took a deeper look in the code to find out why while we have a matching public_network
this still would not work:
The aforementioned cache7 fetches the networks_and_interfaces
key for the affected host from the KV store8.
For our hosts we can see that the desired addresses are not correctly matched and stored in the KV store.
Let's take ceph101
for example:
root@ceph101:/# ip -6 a show dev ipfabric0
8: ipfabric0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
inet6 fd40:abcd::cef:101/128 scope global
valid_lft forever preferred_lft forever
inet6 fe80::18fa:ceff:fef8:7502/64 scope link
valid_lft forever preferred_lft forever
root@ceph101:/# ceph config-key get mgr/cephadm/host.ceph101 | jq '.networks_and_interfaces'
{
"fe80::/64": {
"ens1f0": [
"fe80::ae1f:6bff:fef8:de4e"
],
"ens1f1": [
"fe80::ae1f:6bff:fef8:de4f"
],
"ipfabric0": [
"fe80::18fa:ceff:fef8:7502"
]
}
}
As seen above, the ipfabric0
interface has a valid global IPv6 address fd42:abcd::cef:101/128
which is never detected by the tool that fills the cache,
Instead the matching address is a link-local9 one, which is irrelevant.
In order to find why our /128
IPv6 addresses got rejected by this tool we checked the specific function10 responsible for filling this cache. This function under the hood calls _parse_ipv6_route
passing all routes and IP addresses it found in the system with ip -6 route ls
and ip -6 addr ls
respectively.
The routes passed to it are used to decide which networks is this host connected to and the snippet below rejects all routes without a subnet mask.
if '/' not in net: # only consider networks with a mask
continue
This lead the orchestrator to reject all of our routes:
root@ceph101:/# ip -6 ro ls | grep '::cef'
fd42:abcd::cef:102 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:103 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:104 proto bird src fd42:abcd::cef:101 metric 512
fd42:abcd::cef:105 proto bird src fd42:abcd::cef:101 metric 512
To sum up:
cephadm
assumes that all mon daemons are on the same layer2 domain1, namely the public_network
[2].
This assumption makes some network setups, like our own, to be unable to deploy mon daemons on different subnets.
Frankly, I can't see why cephadm
should care about the underlying network topology when it comes to mon
daemons.
The PR1 that introduced makes it seem like some kind of a safeguard to avoid deploying mon daemons in unwanted hosts, but I believe this is the responsibility of the placement spec11.
It would be nice to either change the logic to not rely on the system routes or even add a flag like mgr/cephadm/skip-mon-network-checks
to bypass this restricted behavior.
[1]: https://github.com/ceph/ceph/pull/33952/
[2]: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#confval-public_network
[3]: https://datatracker.ietf.org/doc/html/rfc8273
[4]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/schedule.py#L307
[5]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L553
[6]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/serve.py#L539
[7]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L225
[8]: https://github.com/ceph/ceph/blob/v16.2.4/src/pybind/mgr/cephadm/inventory.py#L312
[9]: https://en.wikipedia.org/wiki/Link-local_address
[10]: https://github.com/ceph/ceph/blob/v16.2.4/src/cephadm/cephadm#L4602
[11]: https://docs.ceph.com/en/latest/cephadm/service-management/#placement-specification