Feature #48402: multisite option to enable keepalive - rgw - Ceph

Actions

Copy link

Feature #48402

closed

multisite option to enable keepalive

Added by Dieter Roels over 3 years ago. Updated 5 days ago.

Status:

Resolved

Priority:

Normal

Assignee:

Or Friedmann

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

pacific

Reviewed:

Affected Versions:

Pull request ID:

43609

Description

We have a multisite setup with firewalls between the two sites. The firewalls silently drop all connections that are idle for longer than one hour. /proc/sys/net/ipv4/tcp_keepalive_time is configured correctly, but it seems ceph does not use keepalive at all, so it results in frequent drops.

[adminuser@node1 ~]$ netstat -no | grep :443
tcp 0 0 10.10.11.11:443 10.10.12.13:54666 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:6838 10.10.11.12:44344 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:443 10.10.12.12:38604 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:54960 10.10.12.11:443 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.10.11.11:34454 10.10.12.13:443 ESTABLISHED off (0.00/0/0)
...

Would be nice to be able to enable keep_alive for the beast frontend so at least the multisite replication is using keepalive on the tcp connections

Actions

Copy link

Updated by Dieter Roels over 3 years ago

I wonder if https://tracker.ceph.com/issues/47961 would fix my issue as well... If the connections are closed after a certain time the keepalive is not needed.

Actions

Copy link

Updated by Or Friedmann almost 3 years ago

Assignee set to Or Friedmann

Actions

Copy link

Updated by Or Friedmann almost 3 years ago

Pull request ID set to 41824

Actions

Copy link

Updated by Casey Bodley over 2 years ago

Status changed from New to Need More Info

Hi Dieter, I saw your comment about this on https://github.com/ceph/ceph/pull/41824 requesting a frontend option for tcp keepalive.

Can you help me understand why this is an issue with multisite? Does sync have issues reestablishing these connections after they're dropped? I would expect sync to poll more than once per hour - but if it doesn't, then it seems entirely reasonable to drop the connection.

I also don't think tcp keekalive is a good idea for server sockets, because the setting applies to all of its connections. As a server, we want to reclaim the resources of idle client connections - and you pointed at the tracker issue for the beast request timeouts as a possible solution. Does that feature not resolve your issue?

If not, we might instead consider enabling CURLOPT_TCP_KEEPALIVE on the multisite client side.

Actions

Copy link

Updated by Dieter Roels over 2 years ago

Hi Casey,

The problem we are having is the firewalls that sit between our clusters in our multisite setup. The firewalls silently drop unused connections after one hour. A few months ago we overloaded the firewalls with big spikes in replication traffic causing impact on other production applications. During the debugging of that issue we noticed that the rgws frequently run into the issue that they start using connections that are already dropped at the firewall because of inactivity, causing a lot of dropped packets and reconnects further complicating the firewall overloading issue. If these connections would use the tcp keepalive, the firewall would not silently close them.

If there is an option to only enable it on connections from the multisite client that would fix this problem with less impact.

Actions

Copy link

Updated by Casey Bodley over 2 years ago

Subject changed from beast frontend option to enable keepalive to multisite option to enable keepalive
Status changed from Need More Info to Fix Under Review
Pull request ID changed from 41824 to 43609