Project

General

Profile

Feature #48402

multisite option to enable keepalive

Added by Dieter Roels about 1 year ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific
Reviewed:
Affected Versions:
Pull request ID:

Description

We have a multisite setup with firewalls between the two sites. The firewalls silently drop all connections that are idle for longer than one hour. /proc/sys/net/ipv4/tcp_keepalive_time is configured correctly, but it seems ceph does not use keepalive at all, so it results in frequent drops.

[adminuser@node1 ~]$ netstat -no | grep :443
tcp 0 0 10.10.11.11:443 10.10.12.13:54666 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:6838 10.10.11.12:44344 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:443 10.10.12.12:38604 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:54960 10.10.12.11:443 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:34454 10.10.12.13:443 ESTABLISHED off (0.00/0/0)
...

Would be nice to be able to enable keep_alive for the beast frontend so at least the multisite replication is using keepalive on the tcp connections

History

#1 Updated by Dieter Roels about 1 year ago

I wonder if https://tracker.ceph.com/issues/47961 would fix my issue as well... If the connections are closed after a certain time the keepalive is not needed.

#2 Updated by Or Friedmann 6 months ago

  • Assignee set to Or Friedmann

#3 Updated by Or Friedmann 6 months ago

  • Pull request ID set to 41824

#4 Updated by Casey Bodley 3 months ago

  • Status changed from New to Need More Info

Hi Dieter, I saw your comment about this on https://github.com/ceph/ceph/pull/41824 requesting a frontend option for tcp keepalive.

Can you help me understand why this is an issue with multisite? Does sync have issues reestablishing these connections after they're dropped? I would expect sync to poll more than once per hour - but if it doesn't, then it seems entirely reasonable to drop the connection.

I also don't think tcp keekalive is a good idea for server sockets, because the setting applies to all of its connections. As a server, we want to reclaim the resources of idle client connections - and you pointed at the tracker issue for the beast request timeouts as a possible solution. Does that feature not resolve your issue?

If not, we might instead consider enabling CURLOPT_TCP_KEEPALIVE on the multisite client side.

#5 Updated by Dieter Roels 3 months ago

Hi Casey,

The problem we are having is the firewalls that sit between our clusters in our multisite setup. The firewalls silently drop unused connections after one hour. A few months ago we overloaded the firewalls with big spikes in replication traffic causing impact on other production applications. During the debugging of that issue we noticed that the rgws frequently run into the issue that they start using connections that are already dropped at the firewall because of inactivity, causing a lot of dropped packets and reconnects further complicating the firewall overloading issue. If these connections would use the tcp keepalive, the firewall would not silently close them.

If there is an option to only enable it on connections from the multisite client that would fix this problem with less impact.

#6 Updated by Casey Bodley about 2 months ago

  • Subject changed from beast frontend option to enable keepalive to multisite option to enable keepalive
  • Status changed from Need More Info to Fix Under Review
  • Pull request ID changed from 41824 to 43609

#7 Updated by Ken Dreyer about 1 month ago

  • Backport set to pacific

Also available in: Atom PDF