Project

General

Profile

Bug #13002

Accepter::bind won't work correctly in some exception cases

Added by xie xingguo over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
09/09/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Recently I happen to test CEPH as our backend storage system.
I was check the log of an occasionally down osd for some useful information, and find these:

2015-09-07 11:09:33.007255 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300 on any port in range 6800-7300: (99) Cannot assign requested address
2015-09-07 11:09:33.007280 7fba19fb1700 -1 accepter.accepter.bind was unable to bind. Trying again in 5 seconds
2015-09-07 11:09:34.428961 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:30.928096 (cutoff 2015-09-07 11:09:14.428960)
2015-09-07 11:09:38.007389 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300: (99) Cannot assign requested address
2015-09-07 11:09:38.007410 7fba19fb1700 -1 accepter.accepter.bind was unable to bind. Trying again in 5 seconds
2015-09-07 11:09:40.329310 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:34.428796 (cutoff 2015-09-07 11:09:20.329309)
2015-09-07 11:09:42.029559 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:40.329161 (cutoff 2015-09-07 11:09:22.029558)
2015-09-07 11:09:43.007510 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300: (99) Cannot assign requested address

It seems that in our system all the required ports in the specified range(ms_bind_port_min, ms_bind_port_max) are temporarily unavailable, and thus the first attempt of using a specific address(111.111.111.245) with a random port(6800, 7300) is ended up with failure. During the following attemps(every 5 seconds), the bind procedure seems to stick to the specific address(111.111.111.245) and the last tried port(7300 here) and try to reuse this combination?111.111.111.245:7300?to perform bind again and again, instead of picking a random port between [ms_bind_port_min, ms_bind_port_max] just as it first does. Sadly, by coincidence, the specific combination?111.111.111.245:7300?is permanently occupied osd colocated with this one, which causes the rest attemps workless.

When I check the code for the bind procedure logic, things becomes clear:
int Accepter::bind(const entity_addr_t &bind_addr, const set<int>& avoid_ports)
...

_{
// try a range of ports
for (int port = msgr->cct->_conf->ms_bind_port_min; port <= msgr->cct->_conf->ms_bind_port_max; port++) {
if (avoid_ports.count(port))
continue;

listen_addr.set_port(port);
rc = ::bind(listen_sd, (struct sockaddr *) &listen_addr.ss_addr(), listen_addr.addr_size());
if (rc == 0)
break;
}
if (rc < 0) {
lderr(msgr->cct) << "accepter.bind unable to bind to " << listen_addr.ss_addr()
<< " on any port in range " << msgr->cct->conf->ms_bind_port_min
<< "-" << msgr->cct->_conf->ms_bind_port_max
<< ": " << cpp_strerror(errno)
<< dendl;
r = -errno;
continue;
}
ldout(msgr->cct,10) << "accepter.bind bound on random port " << listen_addr << dendl;
}

That is if we cannot bind a specific address with any port in the specific range[ms_bind_port_min, ms_bind_port_max], we shall clear the specific port before retry, otherwishe we shall probably fail again.

My resolution here(For your information):

_// try a range of ports
for (int port = msgr->cct->_conf->ms_bind_port_min; port <= msgr->cct->_conf->ms_bind_port_max; port++) {
if (avoid_ports.count(port))
continue;
listen_addr.set_port(port);
rc = ::bind(listen_sd, (struct sockaddr *) &listen_addr.ss_addr(), listen_addr.addr_size());
if (rc == 0)
break;
}
if (rc < 0) {
lderr(msgr->cct) << "accepter.bind unable to bind to " << listen_addr.ss_addr()
<< " on any port in range " << msgr->cct->conf->ms_bind_port_min
<< "-" << msgr->cct->_conf->ms_bind_port_max
<< ": " << cpp_strerror(errno)
<< dendl;
r = -errno;
listen_addr.set_port(0); // clear port before retry, otherwise we shall fail again.
continue;
}

Associated revisions

Revision 90f1d25c (diff)
Added by xie xingguo over 3 years ago

msg/simple: start over after fails to bind a port in specified range
Fixes: #13002
Signed-off-by:

History

#1 Updated by xie xingguo over 3 years ago

All we need is the following statement:

*listen_addr.set_port(0); // clear port before retry, otherwise we shall fail again.*++

#2 Updated by Kefu Chai over 3 years ago

  • Status changed from New to Need Review
  • Assignee set to xie xingguo

#3 Updated by Kefu Chai over 3 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF