Project

General

Profile

Bug #57116

TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures

Added by Kefu Chai over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[ RUN      ] TestMockMigrationHttpClient.OpenCloseHttp
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenCloseHttp (127 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenCloseHttps
[       OK ] TestMockMigrationHttpClient.OpenCloseHttps (131 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenHttpsHandshakeFail
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenHttpsHandshakeFail (99 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenInvalidUrl
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenInvalidUrl (109 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenResolveFail
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenResolveFail (90 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenConnectFail
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenConnectFail (87 ms)
[ RUN      ] TestMockMigrationHttpClient.IssueHead
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssueHead (83 ms)
[ RUN      ] TestMockMigrationHttpClient.IssueGet
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssueGet (89 ms)
[ RUN      ] TestMockMigrationHttpClient.IssueSendFailed
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssueSendFailed (82 ms)
[ RUN      ] TestMockMigrationHttpClient.IssueReceiveFailed
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssueReceiveFailed (78 ms)
[ RUN      ] TestMockMigrationHttpClient.IssueResetFailed
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssueResetFailed (78 ms)
[ RUN      ] TestMockMigrationHttpClient.IssuePipelined
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.IssuePipelined (76 ms)
[ RUN      ] TestMockMigrationHttpClient.IssuePipelinedRestart
[       OK ] TestMockMigrationHttpClient.IssuePipelinedRestart (104 ms)
[ RUN      ] TestMockMigrationHttpClient.ShutdownInFlight
[       OK ] TestMockMigrationHttpClient.ShutdownInFlight (105 ms)
[ RUN      ] TestMockMigrationHttpClient.GetSize
[       OK ] TestMockMigrationHttpClient.GetSize (127 ms)
[ RUN      ] TestMockMigrationHttpClient.GetSizeError
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.GetSizeError (190 ms)
[ RUN      ] TestMockMigrationHttpClient.Read
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.Read (202 ms)

see https://jenkins.ceph.com/job/ceph-pull-requests/101149/consoleFull

i did some preliminary investigation. but still no clues. as we bind on port 0. and that means we let the os to pick a random usable port for us.

Ilya, could you take a look please?

timed_wait.log.bz2 - output of netstat (73.8 KB) Kefu Chai, 09/09/2022 05:39 PM


Related issues

Related to Ceph - Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures New
Duplicated by rbd - Bug #57227: run-rbd-unit-tests failing on make check Duplicate
Copied to Ceph - Backport #57478: quincy: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures Resolved
Copied to Ceph - Backport #57479: pacific: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures Resolved
Copied to Ceph - Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures New

History

#1 Updated by Kefu Chai over 1 year ago

  • Description updated (diff)

#2 Updated by Kefu Chai over 1 year ago

please note the failures always exhibit themselves as

The following tests FAILED:
     33 - run-rbd-unit-tests-109.sh (Failed)
     34 - run-rbd-unit-tests-127.sh (Failed)

#3 Updated by Kefu Chai over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 47855

#4 Updated by Kefu Chai over 1 year ago

  • Assignee changed from Ilya Dryomov to Kefu Chai

#5 Updated by Christopher Hoffman over 1 year ago

  • Duplicates Bug #57227: run-rbd-unit-tests failing on make check added

#6 Updated by Christopher Hoffman over 1 year ago

  • Status changed from Fix Under Review to Duplicate

#7 Updated by Ilya Dryomov over 1 year ago

  • Status changed from Duplicate to New
  • Pull request ID deleted (47855)

#8 Updated by Ilya Dryomov over 1 year ago

  • Duplicates deleted (Bug #57227: run-rbd-unit-tests failing on make check)

#9 Updated by Ilya Dryomov over 1 year ago

  • Duplicated by Bug #57227: run-rbd-unit-tests failing on make check added

#10 Updated by Ilya Dryomov over 1 year ago

Moved back to New as Kefu's attempted fix in https://github.com/ceph/ceph/pull/47855 didn't pan out and I don't think we are any closer to nailing the root cause of these transient failures.

#11 Updated by Kefu Chai over 1 year ago

  • Assignee deleted (Kefu Chai)

#12 Updated by Kefu Chai over 1 year ago

[ RUN      ] HTTPManager.ReadTruncated
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in the test body.
[  FAILED  ] HTTPManager.ReadTruncated (26 ms)
[ RUN      ] HTTPManager.Head
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in the test body.
[  FAILED  ] HTTPManager.Head (9 ms)

unittest_http_manager suffers from this issue also. see https://jenkins.ceph.com/job/ceph-pull-requests/102806/testReport/junit/projectroot.src.test/rgw/unittest_http_manager/

#13 Updated by Ilya Dryomov over 1 year ago

  • Subject changed from TestMockMigrationHttpClient.* fails to TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures

Yeah, it was pretty clear that it has nothing to do with RBD from the beginning.

#14 Updated by Ilya Dryomov over 1 year ago

  • Tags set to test-failure
  • Project changed from rbd to Ceph

#15 Updated by Kefu Chai over 1 year ago

Ilya Dryomov wrote:

Yeah, it was pretty clear that it has nothing to do with RBD from the beginning.

right. i agreed. just thought we needed a developer to take care of this. and the failure happens to be RBD tests. think the other way, if dashboard uses some third party python module, and it fails to work, shall the core team fix it?

#16 Updated by Radoslaw Zarzynski over 1 year ago

A few of our tests use ephemeral ports and the same time we run tests that specify SO_REUSEADDR:

[rzarzynski@o06 build]$ grep -r REUSE ../src/test
../src/test/msgr/test_async_driver.cc:  int r = ::setsockopt(listen_sd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
../src/test/rgw/rgw_multi/tests_ps.py:        self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
../src/test/rgw/bucket_notification/test_bn.py:        self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

There are also another ones that are using the `boost::asio` interface:

[rzarzynski@o06 build]$ grep -r -A5 tcp::endpoint ../src/test
../src/test/librbd/migration/test_mock_HttpClient.cc:                       boost::asio::ip::tcp::endpoint(
../src/test/librbd/migration/test_mock_HttpClient.cc-                         boost::asio::ip::tcp::v4(), m_server_port), reuse);
../src/test/librbd/migration/test_mock_HttpClient.cc-    m_server_port = m_acceptor->local_endpoint().port();
../src/test/librbd/migration/test_mock_HttpClient.cc-  }
../src/test/librbd/migration/test_mock_HttpClient.cc-
../src/test/librbd/migration/test_mock_HttpClient.cc-  std::string get_local_url(UrlScheme url_scheme) {
Binary file ../src/test/librbd/migration/.test_mock_HttpClient.cc.swp matches
Binary file ../src/test/rgw/.test_http_manager.cc.swp matches
--
../src/test/rgw/test_http_manager.cc:  tcp::endpoint endpoint(tcp::v4(), 0);
../src/test/rgw/test_http_manager.cc-  boost::asio::io_context ioctx;
../src/test/rgw/test_http_manager.cc-  tcp::acceptor acceptor(ioctx);
../src/test/rgw/test_http_manager.cc-  acceptor.open(endpoint.protocol());
../src/test/rgw/test_http_manager.cc-  acceptor.bind(endpoint);
../src/test/rgw/test_http_manager.cc-  acceptor.listen();
--
../src/test/rgw/test_http_manager.cc:  tcp::endpoint endpoint(tcp::v4(), 0);
../src/test/rgw/test_http_manager.cc-  boost::asio::io_context ioctx;
../src/test/rgw/test_http_manager.cc-  tcp::acceptor acceptor(ioctx);
../src/test/rgw/test_http_manager.cc-  acceptor.open(endpoint.protocol());
../src/test/rgw/test_http_manager.cc-  acceptor.bind(endpoint);
../src/test/rgw/test_http_manager.cc-  acceptor.listen();

TestMockMigrationHttpClient and HTTPManager use the Boost's acceptor:

template <typename Protocol, typename Executor>
class basic_socket_acceptor
  : public socket_base
{
public:
  // ...
  basic_socket_acceptor(const executor_type& ex,
      const endpoint_type& endpoint, bool reuse_addr = true)
    : impl_(0, ex)
  {
    boost::system::error_code ec;
    const protocol_type protocol = endpoint.protocol();
    impl_.get_service().open(impl_.get_implementation(), protocol, ec);
    boost::asio::detail::throw_error(ec, "open");
    if (reuse_addr)
    {
      impl_.get_service().set_option(impl_.get_implementation(),
          socket_base::reuse_address(true), ec);
      boost::asio::detail::throw_error(ec, "set_option");
    }
    impl_.get_service().bind(impl_.get_implementation(), endpoint, ec);
    boost::asio::detail::throw_error(ec, "bind");
    impl_.get_service().listen(impl_.get_implementation(),
        socket_base::max_listen_connections, ec);
    boost::asio::detail::throw_error(ec, "listen");
  }

The point here is that `bind()` is decoupled from `listen()`. I bet those abstractions boil down into separated syscalls. If so, then it's possible that a race condition with anything that makes use of SO_REUSEADDR arises.

(...) when an ephemeral port is allocated, SO_REUSEADDR enables the kernel to reuse any other non-listening ephemeral port.

The important point here is that the kernel doesn’t check whether there is an opened socket for an ephemeral port, it only checks whether there is a socket in the listening state for that port.

This means that the kernel is free to reuse an ephemeral port of any opened UDP socket (because listen is not used for datagram sockets) and any opened TCP socket for which listen was not called yet.

Note that when some application uses bind to allocate an ephemeral port for a TCP socket, and then immediately calls listen, there is still a short period of time when the socket is in the non-listening state.

Thus, to prevent the probability of stealing a port of a random running application, take care not to accidentally enable SO_REUSEADDR when using ephemeral ports, both for UDP and TCP sockets.

[See: https://gavv.net/articles/ephemeral-port-reuse/]

This following one can nicely stress the problem out:

TEST_P(EventDriverTest, NetworkSocketTest) {
  int listen_sd = ::socket(AF_INET, SOCK_STREAM, 0);
  ASSERT_TRUE(listen_sd > 0);
  int on = 1;
  int r = ::setsockopt(listen_sd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
  ASSERT_EQ(r, 0);
  r = set_nonblock(listen_sd);
  ASSERT_EQ(r, 0);
  struct sockaddr_in sa;
  long port = 0;
  for (port = 38788; port < 40000; port++) {
    memset(&sa,0,sizeof(sa));
    sa.sin_family = AF_INET;
    sa.sin_port = htons(port);
    sa.sin_addr.s_addr = htonl(INADDR_ANY);

    r = ::bind(listen_sd, (struct sockaddr *)&sa, sizeof(sa));
    if (r == 0) {
      break;
    }
  }
  ASSERT_EQ(r, 0);
  r = listen(listen_sd, 511);

#17 Updated by Ilya Dryomov over 1 year ago

The point here is that `bind()` is decoupled from `listen()`. I bet those abstractions boil down into separated syscalls. If so, then it's possible that a race condition with anything that makes use of SO_REUSEADDR arises.

The thing is, RBD's TestMockMigrationHttpClient goes back to pacific and RGW's HTTPManager all the way back to 2016. However, these tests started sporadically failing only very recently -- it feels like this coincided with the C++20 related upgrades all over the place, although I could be wrong. For a bit I suspected the Boost version bump in main (1.75 -> 1.79), but quincy also seems to be affected and I don't think this version bump was backported there. But then I recall that Jenkins nodes use a custom-built libboost package, so I could be wrong again.

right. i agreed. just thought we needed a developer to take care of this. and the failure happens to be RBD tests. think the other way, if dashboard uses some third party python module, and it fails to work, shall the core team fix it?

And that is the reason I moved it out of RBD -- it is much more widespread and I'm simply lacking most of the context pertaining to what changed/was upgraded/was backported in the last few weeks.

#18 Updated by Kefu Chai over 1 year ago

Thank you Radek for your analysis,

Radoslaw Zarzynski wrote:

A few of our tests use ephemeral ports and the same time we run tests that specify SO_REUSEADDR:

src/test/msgr/test_async_driver.cc is not exercised by "make check", neither is src/test/rgw/rgw_multi/tests_ps.py or src/test/rgw/bucket_notification/test_bn.py. ceph_test_async_driver is tested by qa/suites/rados/singleton-nomsgr/all/msgr.yaml.

but it seems rgw_asio_frontend.cc does set SO_REUSEADDR though:

$ git grep tcp::acceptor::reuse_address
src/rgw/rgw_asio_frontend.cc:    l.acceptor.set_option(tcp::acceptor::reuse_address(true));

#19 Updated by Kefu Chai over 1 year ago

based on the analysis, i am proposing https://github.com/ceph/ceph/pull/47962 in hope to alleviate the symptom. kinda tired of rolling the dice.

#20 Updated by Ilya Dryomov over 1 year ago

  • Status changed from New to Fix Under Review
  • Assignee set to Kefu Chai
  • Pull request ID set to 47962

#21 Updated by Casey Bodley over 1 year ago

Ilya Dryomov wrote:

For a bit I suspected the Boost version bump in main (1.75 -> 1.79), but quincy also seems to be affected and I don't think this version bump was backported there. But then I recall that Jenkins nodes use a custom-built libboost package, so I could be wrong again.

according to https://github.com/ceph/ceph/blob/quincy/install-deps.sh#L168-L169, jenkins should still be using 1.75 packages on quincy

#22 Updated by Casey Bodley over 1 year ago

Ilya Dryomov wrote:

Moved back to New as Kefu's attempted fix in https://github.com/ceph/ceph/pull/47855 didn't pan out

this seems to have ruled out a misconfigured net.ipv4.ip_local_port_range as the cause, so i can only guess that some process running on the jenkins nodes is opening a ton of sockets and leaving them in TIME_WAIT - a netstat -ntp there might show us which

#23 Updated by Kefu Chai over 1 year ago

  • Status changed from Fix Under Review to Resolved

#24 Updated by Kefu Chai over 1 year ago

  • Status changed from Resolved to New

#25 Updated by Kefu Chai over 1 year ago

  • Assignee deleted (Kefu Chai)
  • Pull request ID deleted (47962)

#26 Updated by Kefu Chai over 1 year ago

[ RUN      ] TestMockMigrationHttpClient.OpenCloseHttp
unknown file: Failure
C++ exception with description "bind: Address already in use" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenCloseHttp (9285 ms)
[ RUN      ] TestMockMigrationHttpClient.OpenCloseHttps
unknown file: Failure
C++ exception with description "bind: Address already in use" thrown in SetUp().
[  FAILED  ] TestMockMigrationHttpClient.OpenCloseHttps (9271 ms)

two failures in a row spotted in pacific backport.

https://jenkins.ceph.com/job/ceph-pull-requests/103266/consoleFull

#27 Updated by Kefu Chai over 1 year ago

Casey Bodley wrote:

this seems to have ruled out a misconfigured net.ipv4.ip_local_port_range as the cause, so i can only guess that some process running on the jenkins nodes is opening a ton of sockets and leaving them in TIME_WAIT - a netstat -ntp there might show us which

very likely. as the wait-and-retry approach actually failed also. it might help a little bit though.

#28 Updated by Ilya Dryomov over 1 year ago

Yeah, I think it helps -- the frequency is definitely reduced. Here is one for the RGW case (all RBD tests passed on that particular run):

[ RUN      ] HTTPManager.ReadTruncated
unknown file: Failure
C++ exception with description "bind: Address already in use [system:98]" thrown in the test body.
[  FAILED  ] HTTPManager.ReadTruncated (9134 ms)

We could increase the number of retries or the duration of the sleep of course, but it would be good to try to catch it in action on one of the Jenkins nodes. Something changed around the end of July and we still don't know what it is.

#29 Updated by Laura Flores over 1 year ago

  • Backport set to quincy,pacific

#30 Updated by Laura Flores over 1 year ago

  • Backport deleted (quincy,pacific)

Never mind, this is environmental on the Jenkins nodes, so we don't need to backport.

#31 Updated by Laura Flores over 1 year ago

Ilya Dryomov wrote:

We could increase the number of retries or the duration of the sleep of course, but it would be good to try to catch it in action on one of the Jenkins nodes. Something changed around the end of July and we still don't know what it is.

Affected Jenkins nodes I've seen include: Jenkins nodes that are passing:

I've also been looking into https://en.cppreference.com/w/cpp/20 for any clues on what might have changed with C++20.

#32 Updated by Laura Flores over 1 year ago

What do we think about temporarily disabling run-rbd-unit-tests while we work on a solution?

#33 Updated by Ilya Dryomov over 1 year ago

Laura Flores wrote:

I've also been looking into https://en.cppreference.com/w/cpp/20 for any clues on what might have changed with C++20.

I don't think it's related to C++20 per se. A whole bunch of things changed/were upgraded together with that -- Ubuntu Jammy, Boost 1.79, etc.

What do we think about temporarily disabling run-rbd-unit-tests while we work on a solution?

I'd rather not because there is nothing more permanent than something temporary ;) Also, it's not just RBD but also a couple of tests in RGW.

That said, we can definitely make the workaround cruder. I opened https://github.com/ceph/ceph/pull/48014 to increase the cumulative duration of the sleep to 1 minute.

#34 Updated by Mark Kogan over 1 year ago

this came up in today's bug scrub, myb:

sudo sysctl -w net.ipv4.tcp_tw_reuse=1

ref: https://sysctl-explorer.net/net/ipv4/tcp_tw_reuse/ -- quote:

Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.

#35 Updated by Laura Flores over 1 year ago

Ilya Dryomov wrote:

I'd rather not because there is nothing more permanent than something temporary ;) Also, it's not just RBD but also a couple of tests in RGW.

That said, we can definitely make the workaround cruder. I opened https://github.com/ceph/ceph/pull/48014 to increase the cumulative duration of the sleep to 1 minute.

I understand. I'm glad we now have some solution, even if it is cruder! From what I understand, your fix will reduce, but not eliminate the failures, In that case, I'd want to ensure that the issue is somehow tracked, either in this Tracker or in a new one.

#36 Updated by Laura Flores over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 48014

#37 Updated by Ilya Dryomov over 1 year ago

  • Backport set to pacific,quincy

Laura Flores wrote:

From what I understand, your fix will reduce, but not eliminate the failures, In that case, I'd want to ensure that the issue is somehow tracked, either in this Tracker or in a new one.

It is really Kefu's fix^Wworkaround, I'm just massaging it. Since you chose to link it as a PR here and therefore to stop treating this as an environmental issue, I'm re-populating Backport field.

#38 Updated by Ilya Dryomov over 1 year ago

  • Assignee set to Ilya Dryomov

#39 Updated by Ilya Dryomov over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#40 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57478: quincy: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added

#41 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57479: pacific: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added

#42 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#43 Updated by Ilya Dryomov over 1 year ago

  • Status changed from Pending Backport to Resolved

#45 Updated by Laura Flores over 1 year ago

  • Copied to Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added

#46 Updated by Laura Flores over 1 year ago

  • Related to Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added

#47 Updated by Laura Flores over 1 year ago

Created https://tracker.ceph.com/issues/57491 to document Kefu's extra fix.

Also available in: Atom PDF