Bug #57116
TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures
0%
Description
[ RUN ] TestMockMigrationHttpClient.OpenCloseHttp unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenCloseHttp (127 ms) [ RUN ] TestMockMigrationHttpClient.OpenCloseHttps [ OK ] TestMockMigrationHttpClient.OpenCloseHttps (131 ms) [ RUN ] TestMockMigrationHttpClient.OpenHttpsHandshakeFail unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenHttpsHandshakeFail (99 ms) [ RUN ] TestMockMigrationHttpClient.OpenInvalidUrl unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenInvalidUrl (109 ms) [ RUN ] TestMockMigrationHttpClient.OpenResolveFail unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenResolveFail (90 ms) [ RUN ] TestMockMigrationHttpClient.OpenConnectFail unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenConnectFail (87 ms) [ RUN ] TestMockMigrationHttpClient.IssueHead unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssueHead (83 ms) [ RUN ] TestMockMigrationHttpClient.IssueGet unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssueGet (89 ms) [ RUN ] TestMockMigrationHttpClient.IssueSendFailed unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssueSendFailed (82 ms) [ RUN ] TestMockMigrationHttpClient.IssueReceiveFailed unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssueReceiveFailed (78 ms) [ RUN ] TestMockMigrationHttpClient.IssueResetFailed unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssueResetFailed (78 ms) [ RUN ] TestMockMigrationHttpClient.IssuePipelined unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.IssuePipelined (76 ms) [ RUN ] TestMockMigrationHttpClient.IssuePipelinedRestart [ OK ] TestMockMigrationHttpClient.IssuePipelinedRestart (104 ms) [ RUN ] TestMockMigrationHttpClient.ShutdownInFlight [ OK ] TestMockMigrationHttpClient.ShutdownInFlight (105 ms) [ RUN ] TestMockMigrationHttpClient.GetSize [ OK ] TestMockMigrationHttpClient.GetSize (127 ms) [ RUN ] TestMockMigrationHttpClient.GetSizeError unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.GetSizeError (190 ms) [ RUN ] TestMockMigrationHttpClient.Read unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.Read (202 ms)
see https://jenkins.ceph.com/job/ceph-pull-requests/101149/consoleFull
i did some preliminary investigation. but still no clues. as we bind on port 0. and that means we let the os to pick a random usable port for us.
Ilya, could you take a look please?
Related issues
History
#1 Updated by Kefu Chai over 1 year ago
- Description updated (diff)
#2 Updated by Kefu Chai over 1 year ago
please note the failures always exhibit themselves as
The following tests FAILED: 33 - run-rbd-unit-tests-109.sh (Failed) 34 - run-rbd-unit-tests-127.sh (Failed)
#3 Updated by Kefu Chai over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 47855
#4 Updated by Kefu Chai over 1 year ago
- Assignee changed from Ilya Dryomov to Kefu Chai
#5 Updated by Christopher Hoffman over 1 year ago
- Duplicates Bug #57227: run-rbd-unit-tests failing on make check added
#6 Updated by Christopher Hoffman over 1 year ago
- Status changed from Fix Under Review to Duplicate
#7 Updated by Ilya Dryomov over 1 year ago
- Status changed from Duplicate to New
- Pull request ID deleted (
47855)
#8 Updated by Ilya Dryomov over 1 year ago
- Duplicates deleted (Bug #57227: run-rbd-unit-tests failing on make check)
#9 Updated by Ilya Dryomov over 1 year ago
- Duplicated by Bug #57227: run-rbd-unit-tests failing on make check added
#10 Updated by Ilya Dryomov over 1 year ago
Moved back to New as Kefu's attempted fix in https://github.com/ceph/ceph/pull/47855 didn't pan out and I don't think we are any closer to nailing the root cause of these transient failures.
#11 Updated by Kefu Chai over 1 year ago
- Assignee deleted (
Kefu Chai)
#12 Updated by Kefu Chai over 1 year ago
[ RUN ] HTTPManager.ReadTruncated unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in the test body. [ FAILED ] HTTPManager.ReadTruncated (26 ms) [ RUN ] HTTPManager.Head unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in the test body. [ FAILED ] HTTPManager.Head (9 ms)
unittest_http_manager suffers from this issue also. see https://jenkins.ceph.com/job/ceph-pull-requests/102806/testReport/junit/projectroot.src.test/rgw/unittest_http_manager/
#13 Updated by Ilya Dryomov over 1 year ago
- Subject changed from TestMockMigrationHttpClient.* fails to TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures
Yeah, it was pretty clear that it has nothing to do with RBD from the beginning.
#14 Updated by Ilya Dryomov over 1 year ago
- Tags set to test-failure
- Project changed from rbd to Ceph
#15 Updated by Kefu Chai over 1 year ago
Ilya Dryomov wrote:
Yeah, it was pretty clear that it has nothing to do with RBD from the beginning.
right. i agreed. just thought we needed a developer to take care of this. and the failure happens to be RBD tests. think the other way, if dashboard uses some third party python module, and it fails to work, shall the core team fix it?
#16 Updated by Radoslaw Zarzynski over 1 year ago
A few of our tests use ephemeral ports and the same time we run tests that specify SO_REUSEADDR
:
[rzarzynski@o06 build]$ grep -r REUSE ../src/test ../src/test/msgr/test_async_driver.cc: int r = ::setsockopt(listen_sd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on)); ../src/test/rgw/rgw_multi/tests_ps.py: self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) ../src/test/rgw/bucket_notification/test_bn.py: self.sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
There are also another ones that are using the `boost::asio` interface:
[rzarzynski@o06 build]$ grep -r -A5 tcp::endpoint ../src/test ../src/test/librbd/migration/test_mock_HttpClient.cc: boost::asio::ip::tcp::endpoint( ../src/test/librbd/migration/test_mock_HttpClient.cc- boost::asio::ip::tcp::v4(), m_server_port), reuse); ../src/test/librbd/migration/test_mock_HttpClient.cc- m_server_port = m_acceptor->local_endpoint().port(); ../src/test/librbd/migration/test_mock_HttpClient.cc- } ../src/test/librbd/migration/test_mock_HttpClient.cc- ../src/test/librbd/migration/test_mock_HttpClient.cc- std::string get_local_url(UrlScheme url_scheme) { Binary file ../src/test/librbd/migration/.test_mock_HttpClient.cc.swp matches Binary file ../src/test/rgw/.test_http_manager.cc.swp matches -- ../src/test/rgw/test_http_manager.cc: tcp::endpoint endpoint(tcp::v4(), 0); ../src/test/rgw/test_http_manager.cc- boost::asio::io_context ioctx; ../src/test/rgw/test_http_manager.cc- tcp::acceptor acceptor(ioctx); ../src/test/rgw/test_http_manager.cc- acceptor.open(endpoint.protocol()); ../src/test/rgw/test_http_manager.cc- acceptor.bind(endpoint); ../src/test/rgw/test_http_manager.cc- acceptor.listen(); -- ../src/test/rgw/test_http_manager.cc: tcp::endpoint endpoint(tcp::v4(), 0); ../src/test/rgw/test_http_manager.cc- boost::asio::io_context ioctx; ../src/test/rgw/test_http_manager.cc- tcp::acceptor acceptor(ioctx); ../src/test/rgw/test_http_manager.cc- acceptor.open(endpoint.protocol()); ../src/test/rgw/test_http_manager.cc- acceptor.bind(endpoint); ../src/test/rgw/test_http_manager.cc- acceptor.listen();
TestMockMigrationHttpClient
and HTTPManager
use the Boost's acceptor
:
template <typename Protocol, typename Executor>
class basic_socket_acceptor
: public socket_base
{
public:
// ...
basic_socket_acceptor(const executor_type& ex,
const endpoint_type& endpoint, bool reuse_addr = true)
: impl_(0, ex)
{
boost::system::error_code ec;
const protocol_type protocol = endpoint.protocol();
impl_.get_service().open(impl_.get_implementation(), protocol, ec);
boost::asio::detail::throw_error(ec, "open");
if (reuse_addr)
{
impl_.get_service().set_option(impl_.get_implementation(),
socket_base::reuse_address(true), ec);
boost::asio::detail::throw_error(ec, "set_option");
}
impl_.get_service().bind(impl_.get_implementation(), endpoint, ec);
boost::asio::detail::throw_error(ec, "bind");
impl_.get_service().listen(impl_.get_implementation(),
socket_base::max_listen_connections, ec);
boost::asio::detail::throw_error(ec, "listen");
}
The point here is that `bind()` is decoupled from `listen()`. I bet those abstractions boil down into separated syscalls. If so, then it's possible that a race condition with anything that makes use of SO_REUSEADDR
arises.
(...) when an ephemeral port is allocated, SO_REUSEADDR enables the kernel to reuse any other non-listening ephemeral port.
The important point here is that the kernel doesn’t check whether there is an opened socket for an ephemeral port, it only checks whether there is a socket in the listening state for that port.
This means that the kernel is free to reuse an ephemeral port of any opened UDP socket (because listen is not used for datagram sockets) and any opened TCP socket for which listen was not called yet.
Note that when some application uses bind to allocate an ephemeral port for a TCP socket, and then immediately calls listen, there is still a short period of time when the socket is in the non-listening state.
Thus, to prevent the probability of stealing a port of a random running application, take care not to accidentally enable SO_REUSEADDR when using ephemeral ports, both for UDP and TCP sockets.
This following one can nicely stress the problem out:
TEST_P(EventDriverTest, NetworkSocketTest) {
int listen_sd = ::socket(AF_INET, SOCK_STREAM, 0);
ASSERT_TRUE(listen_sd > 0);
int on = 1;
int r = ::setsockopt(listen_sd, SOL_SOCKET, SO_REUSEADDR, &on, sizeof(on));
ASSERT_EQ(r, 0);
r = set_nonblock(listen_sd);
ASSERT_EQ(r, 0);
struct sockaddr_in sa;
long port = 0;
for (port = 38788; port < 40000; port++) {
memset(&sa,0,sizeof(sa));
sa.sin_family = AF_INET;
sa.sin_port = htons(port);
sa.sin_addr.s_addr = htonl(INADDR_ANY);
r = ::bind(listen_sd, (struct sockaddr *)&sa, sizeof(sa));
if (r == 0) {
break;
}
}
ASSERT_EQ(r, 0);
r = listen(listen_sd, 511);
#17 Updated by Ilya Dryomov over 1 year ago
The point here is that `bind()` is decoupled from `listen()`. I bet those abstractions boil down into separated syscalls. If so, then it's possible that a race condition with anything that makes use of SO_REUSEADDR arises.
The thing is, RBD's TestMockMigrationHttpClient goes back to pacific and RGW's HTTPManager all the way back to 2016. However, these tests started sporadically failing only very recently -- it feels like this coincided with the C++20 related upgrades all over the place, although I could be wrong. For a bit I suspected the Boost version bump in main (1.75 -> 1.79), but quincy also seems to be affected and I don't think this version bump was backported there. But then I recall that Jenkins nodes use a custom-built libboost package, so I could be wrong again.
right. i agreed. just thought we needed a developer to take care of this. and the failure happens to be RBD tests. think the other way, if dashboard uses some third party python module, and it fails to work, shall the core team fix it?
And that is the reason I moved it out of RBD -- it is much more widespread and I'm simply lacking most of the context pertaining to what changed/was upgraded/was backported in the last few weeks.
#18 Updated by Kefu Chai over 1 year ago
Thank you Radek for your analysis,
Radoslaw Zarzynski wrote:
A few of our tests use ephemeral ports and the same time we run tests that specify
SO_REUSEADDR
:
src/test/msgr/test_async_driver.cc
is not exercised by "make check", neither is src/test/rgw/rgw_multi/tests_ps.py
or src/test/rgw/bucket_notification/test_bn.py
. ceph_test_async_driver is tested by qa/suites/rados/singleton-nomsgr/all/msgr.yaml
.
but it seems rgw_asio_frontend.cc
does set SO_REUSEADDR
though:
$ git grep tcp::acceptor::reuse_address src/rgw/rgw_asio_frontend.cc: l.acceptor.set_option(tcp::acceptor::reuse_address(true));
#19 Updated by Kefu Chai over 1 year ago
based on the analysis, i am proposing https://github.com/ceph/ceph/pull/47962 in hope to alleviate the symptom. kinda tired of rolling the dice.
#20 Updated by Ilya Dryomov over 1 year ago
- Status changed from New to Fix Under Review
- Assignee set to Kefu Chai
- Pull request ID set to 47962
#21 Updated by Casey Bodley over 1 year ago
Ilya Dryomov wrote:
For a bit I suspected the Boost version bump in main (1.75 -> 1.79), but quincy also seems to be affected and I don't think this version bump was backported there. But then I recall that Jenkins nodes use a custom-built libboost package, so I could be wrong again.
according to https://github.com/ceph/ceph/blob/quincy/install-deps.sh#L168-L169, jenkins should still be using 1.75 packages on quincy
#22 Updated by Casey Bodley over 1 year ago
Ilya Dryomov wrote:
Moved back to New as Kefu's attempted fix in https://github.com/ceph/ceph/pull/47855 didn't pan out
this seems to have ruled out a misconfigured net.ipv4.ip_local_port_range
as the cause, so i can only guess that some process running on the jenkins nodes is opening a ton of sockets and leaving them in TIME_WAIT
- a netstat -ntp
there might show us which
#23 Updated by Kefu Chai over 1 year ago
- Status changed from Fix Under Review to Resolved
#24 Updated by Kefu Chai over 1 year ago
- Status changed from Resolved to New
#25 Updated by Kefu Chai over 1 year ago
- Assignee deleted (
Kefu Chai) - Pull request ID deleted (
47962)
#26 Updated by Kefu Chai over 1 year ago
[ RUN ] TestMockMigrationHttpClient.OpenCloseHttp unknown file: Failure C++ exception with description "bind: Address already in use" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenCloseHttp (9285 ms) [ RUN ] TestMockMigrationHttpClient.OpenCloseHttps unknown file: Failure C++ exception with description "bind: Address already in use" thrown in SetUp(). [ FAILED ] TestMockMigrationHttpClient.OpenCloseHttps (9271 ms)
two failures in a row spotted in pacific backport.
https://jenkins.ceph.com/job/ceph-pull-requests/103266/consoleFull
#27 Updated by Kefu Chai over 1 year ago
Casey Bodley wrote:
this seems to have ruled out a misconfigured
net.ipv4.ip_local_port_range
as the cause, so i can only guess that some process running on the jenkins nodes is opening a ton of sockets and leaving them inTIME_WAIT
- anetstat -ntp
there might show us which
very likely. as the wait-and-retry approach actually failed also. it might help a little bit though.
#28 Updated by Ilya Dryomov over 1 year ago
Yeah, I think it helps -- the frequency is definitely reduced. Here is one for the RGW case (all RBD tests passed on that particular run):
[ RUN ] HTTPManager.ReadTruncated unknown file: Failure C++ exception with description "bind: Address already in use [system:98]" thrown in the test body. [ FAILED ] HTTPManager.ReadTruncated (9134 ms)
We could increase the number of retries or the duration of the sleep of course, but it would be good to try to catch it in action on one of the Jenkins nodes. Something changed around the end of July and we still don't know what it is.
#29 Updated by Laura Flores over 1 year ago
- Backport set to quincy,pacific
#30 Updated by Laura Flores over 1 year ago
- Backport deleted (
quincy,pacific)
Never mind, this is environmental on the Jenkins nodes, so we don't need to backport.
#31 Updated by Laura Flores over 1 year ago
Ilya Dryomov wrote:
Affected Jenkins nodes I've seen include:We could increase the number of retries or the duration of the sleep of course, but it would be good to try to catch it in action on one of the Jenkins nodes. Something changed around the end of July and we still don't know what it is.
- braggi10: https://jenkins.ceph.com/job/ceph-pull-requests/103237/, https://jenkins.ceph.com/job/ceph-pull-requests/103237/
- braggi12: https://jenkins.ceph.com/job/ceph-pull-requests/103266/
I've also been looking into https://en.cppreference.com/w/cpp/20 for any clues on what might have changed with C++20.
#32 Updated by Laura Flores over 1 year ago
What do we think about temporarily disabling run-rbd-unit-tests while we work on a solution?
#33 Updated by Ilya Dryomov over 1 year ago
Laura Flores wrote:
I've also been looking into https://en.cppreference.com/w/cpp/20 for any clues on what might have changed with C++20.
I don't think it's related to C++20 per se. A whole bunch of things changed/were upgraded together with that -- Ubuntu Jammy, Boost 1.79, etc.
What do we think about temporarily disabling run-rbd-unit-tests while we work on a solution?
I'd rather not because there is nothing more permanent than something temporary ;) Also, it's not just RBD but also a couple of tests in RGW.
That said, we can definitely make the workaround cruder. I opened https://github.com/ceph/ceph/pull/48014 to increase the cumulative duration of the sleep to 1 minute.
#34 Updated by Mark Kogan over 1 year ago
this came up in today's bug scrub, myb:
sudo sysctl -w net.ipv4.tcp_tw_reuse=1
ref: https://sysctl-explorer.net/net/ipv4/tcp_tw_reuse/ -- quote:
Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.
#35 Updated by Laura Flores over 1 year ago
Ilya Dryomov wrote:
I'd rather not because there is nothing more permanent than something temporary ;) Also, it's not just RBD but also a couple of tests in RGW.
That said, we can definitely make the workaround cruder. I opened https://github.com/ceph/ceph/pull/48014 to increase the cumulative duration of the sleep to 1 minute.
I understand. I'm glad we now have some solution, even if it is cruder! From what I understand, your fix will reduce, but not eliminate the failures, In that case, I'd want to ensure that the issue is somehow tracked, either in this Tracker or in a new one.
#36 Updated by Laura Flores over 1 year ago
- Status changed from New to Fix Under Review
- Pull request ID set to 48014
#37 Updated by Ilya Dryomov over 1 year ago
- Backport set to pacific,quincy
Laura Flores wrote:
From what I understand, your fix will reduce, but not eliminate the failures, In that case, I'd want to ensure that the issue is somehow tracked, either in this Tracker or in a new one.
It is really Kefu's fix^Wworkaround, I'm just massaging it. Since you chose to link it as a PR here and therefore to stop treating this as an environmental issue, I'm re-populating Backport field.
#38 Updated by Ilya Dryomov over 1 year ago
- Assignee set to Ilya Dryomov
#39 Updated by Ilya Dryomov over 1 year ago
- Status changed from Fix Under Review to Pending Backport
#40 Updated by Backport Bot over 1 year ago
- Copied to Backport #57478: quincy: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added
#41 Updated by Backport Bot over 1 year ago
- Copied to Backport #57479: pacific: TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added
#42 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed
#43 Updated by Ilya Dryomov over 1 year ago
- Status changed from Pending Backport to Resolved
#44 Updated by Kefu Chai over 1 year ago
- File timed_wait.log.bz2 added
confirmed, it is TIME_WAIT. see
$ grep TIME_WAIT timed_wait.log | wc -l 28943
i am reviving https://github.com/ceph/ceph/pull/48018
#45 Updated by Laura Flores over 1 year ago
- Copied to Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added
#46 Updated by Laura Flores over 1 year ago
- Related to Bug #57491: Remaining issues involving TestMockMigrationHttpClient and HTTPManager "bind: Address already in use [system:98]" failures added
#47 Updated by Laura Flores over 1 year ago
Created https://tracker.ceph.com/issues/57491 to document Kefu's extra fix.