Bug #61852
open
Ceph NFS "HAProxy_Hosts" configuration issue
Added by Goutham Pacha Ravi 11 months ago.
Updated 9 months ago.
Description
Ceph's Reef release added support for deploying the ceph-nfs service and the ceph-ingress service with a "haproxy-protocol" ingress mode [1]2
There's a config issue that happens when attempting this on ceph nodes that have multiple IP addresses.
NFS cluster creation and some debug info: https://paste.openstack.org/show/bVyG6N1E876PY1G30fZo/
NFS cluster configuration (`/etc/ganesha/ganesha.conf`) is extracted from one of the nfs containers created; it is identical across the nodes.
With this configuration, NFS mounts fail because HAProxy PROXY protocol parsing fails on NFS-Ganesha:
Mount failure on the client:
# mount -vvvt nfs 192.168.130.21:/volumes/_nogroup/399c3d6a-b50d-4e33-87d7-8848c1680e3b/406f33bf-26ab-4e04-aa27-0be027af4d24 /mnt
mount.nfs: timeout set for Wed Jun 28 17:29:34 2023
mount.nfs: trying text-based options 'vers=4.2,addr=192.168.130.21,clientaddr=192.168.130.28'
mount.nfs: mount(2): Input/output error
mount.nfs: mount system call failed
Logs from an NFS-Ganesha service when mount failed: https://paste.openstack.org/show/belF8k02E7HmmFAGhuNN/
[1] https://docs.ceph.com/en/latest/mgr/nfs/#create-nfs-ganesha-cluster
[2] https://github.com/ceph/ceph/pull/50614
Related issues
1 (1 open — 0 closed)
NFS cluster creation and some debug info: https://paste.openstack.org/show/bVyG6N1E876PY1G30fZo/
I don't see the corresponding HAProxy configuration file. Could you please share that as well?
NFS cluster configuration (`/etc/ganesha/ganesha.conf`) is extracted from one of the nfs containers created; it is identical across the nodes.
With this configuration, NFS mounts fail because HAProxy PROXY protocol parsing fails on NFS-Ganesha:
I don't think that's quite right. The parsing of the configured IP addresses is OK, and it likely parses the haproxy protocol message OK, the issue is that HAProxy connects to a different (local) IP address than the one in the Ganesha config. Ganesha fails to see the IP in the list of hosts permitted to send proxy protocol info and rejects the connection.
What I need to know is if cephadm/mgr is generating an inconsistent set of configuration files, telling HAProxy to connect to NFS over one IP address while configuring the permitted IP to be something else OR if we're not inconsistent but vague and HAProxy chose a valid but unexpected IP and thus the fix would be to cast a wider net when configuring ganesha.
Thanks!
Thanks for sharing that file. Adam reproduced it yesterday and we debugged the issue for a bit.
Here's a simplified summary of the problem:
When cephadm configures haproxy it adds a list of back-end servers to the haproxy conf file.
The addresses stored in this file are the "primary" addresses that cephadm associates with
each host.
When we are configured to use the haproxy protocol we put a list of IP addresses that are
permitted to send proxy protocol messages to Ganesha. Remember that these are the addresses
of the clients. In our simple example we would list the following addresses:
- 192.168.12.10
- 192.168.12.11
- 10.2.2.112
Similarly, in our simplified model, we can have just one NFS Ganesha server running.
Thus we would add the following backend servers to the HAProxy config:
- 10.2.2.112
In this simple example we run only one Ganesha server but the principle applies even if
there are more servers set up for the HAProxy backend.
The following block diagram shows how a NFS connection initiated by "client" first
reaches HAProxy running on HostA via the VIP (Virtual IP). Then HAProxy makes a
connection to the ganesha server running on HostC.
+-----------------------+
-------- | eth0: 192.168.12.10 |
|client| -------------------->| eth1: 10.2.2.110 |-------\
-------- srcIP=xxx | | |
dstIP=VIP | VIP, HAProxy | |
+-----------------------+ |
HostA |
|
+-----------------------+ |
| eth0: 192.168.12.11 | |
| eth1: 10.2.2.111 | |
| | |
| HAProxy | |
+-----------------------+ |
HostB |srcIP=10.2.2.110
|dstIP=10.2.2.112
+-----------------------+ |
| eth0: 192.168.12.12 | |
| eth1: 10.2.2.112 |<------/
| |
| HAProxy, NFSGanesha |
+-----------------------+
HostC
The srcIP for the connection made by HAProxy on HostA to NFSGanesha on HostB
is '10.2.2.110' this is the IP of the client. '10.2.2.110' is not in the list of clients
allowed to use HAProxy protocol. Therefore the connection is rejected.
The issue arises because the IPs are not all on the same subnet, so even though we configured
the "same list" of IPs in the ganesha and haproxy configuration, the source address ends up
being different from the allowed source address for that host in the ganesha
allowed-proxy-protocol hosts list.
We do not have a simple and obvious solution at this time. Currently we have the following ideas:
- add all known ip addresses for all hosts into the list
- require all the addresses to be on the same subnet, if not then the configuration will fail (and raise a health warning)
- Pull request ID set to 52410
- Status changed from New to Fix Under Review
Thank you; you folks would definitely know the internals better to design this; but i had an opinion to offer:
- add all known ip addresses for all hosts into the list
++
- require all the addresses to be on the same subnet, if not then the configuration will fail (and raise a health warning)
This would be hard to do imo; multiple network subnets could exist on a host for other reasons; and NFS traffic may be consumed from any of them.
I had a further question, what if an interface has multiple IPs configured? isn't this then an n*m set of IPs we need to collate and set as "HAProxy_Hosts"?
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef
- Copied to Backport #62463: reef: Ceph NFS "HAProxy_Hosts" configuration issue added
- Tags set to backport_processed
Also available in: Atom
PDF