multisite option to enable keepalive
We have a multisite setup with firewalls between the two sites. The firewalls silently drop all connections that are idle for longer than one hour. /proc/sys/net/ipv4/tcp_keepalive_time is configured correctly, but it seems ceph does not use keepalive at all, so it results in frequent drops.
[adminuser@node1 ~]$ netstat -no | grep :443
tcp 0 0 10.10.11.11:443 10.10.12.13:54666 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:6838 10.10.11.12:44344 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:443 10.10.12.12:38604 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:54960 10.10.12.11:443 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:34454 10.10.12.13:443 ESTABLISHED off (0.00/0/0)
Would be nice to be able to enable keep_alive for the beast frontend so at least the multisite replication is using keepalive on the tcp connections
#4 Updated by Casey Bodley 3 months ago
- Status changed from New to Need More Info
Hi Dieter, I saw your comment about this on https://github.com/ceph/ceph/pull/41824 requesting a frontend option for tcp keepalive.
Can you help me understand why this is an issue with multisite? Does sync have issues reestablishing these connections after they're dropped? I would expect sync to poll more than once per hour - but if it doesn't, then it seems entirely reasonable to drop the connection.
I also don't think tcp keekalive is a good idea for server sockets, because the setting applies to all of its connections. As a server, we want to reclaim the resources of idle client connections - and you pointed at the tracker issue for the beast request timeouts as a possible solution. Does that feature not resolve your issue?
If not, we might instead consider enabling CURLOPT_TCP_KEEPALIVE on the multisite client side.
#5 Updated by Dieter Roels 3 months ago
The problem we are having is the firewalls that sit between our clusters in our multisite setup. The firewalls silently drop unused connections after one hour. A few months ago we overloaded the firewalls with big spikes in replication traffic causing impact on other production applications. During the debugging of that issue we noticed that the rgws frequently run into the issue that they start using connections that are already dropped at the firewall because of inactivity, causing a lot of dropped packets and reconnects further complicating the firewall overloading issue. If these connections would use the tcp keepalive, the firewall would not silently close them.
If there is an option to only enable it on connections from the multisite client that would fix this problem with less impact.