Bug #13002
closedAccepter::bind won't work correctly in some exception cases
0%
Description
Recently I happen to test CEPH as our backend storage system.
I was check the log of an occasionally down osd for some useful information, and find these:
2015-09-07 11:09:33.007255 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300 on any port in range 6800-7300: (99) Cannot assign requested address
2015-09-07 11:09:33.007280 7fba19fb1700 -1 accepter.accepter.bind was unable to bind. Trying again in 5 seconds
2015-09-07 11:09:34.428961 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:30.928096 (cutoff 2015-09-07 11:09:14.428960)
2015-09-07 11:09:38.007389 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300: (99) Cannot assign requested address
2015-09-07 11:09:38.007410 7fba19fb1700 -1 accepter.accepter.bind was unable to bind. Trying again in 5 seconds
2015-09-07 11:09:40.329310 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:34.428796 (cutoff 2015-09-07 11:09:20.329309)
2015-09-07 11:09:42.029559 7fba08f8f700 -1 osd.10 183299 heartbeat_check: no reply from osd.8 since back 2015-09-07 11:09:00.821019 front 2015-09-07 11:09:40.329161 (cutoff 2015-09-07 11:09:22.029558)
2015-09-07 11:09:43.007510 7fba19fb1700 -1 accepter.accepter.bind unable to bind to 111.111.111.245:7300: (99) Cannot assign requested address
It seems that in our system all the required ports in the specified range(ms_bind_port_min, ms_bind_port_max) are temporarily unavailable, and thus the first attempt of using a specific address(111.111.111.245) with a random port(6800, 7300) is ended up with failure. During the following attemps(every 5 seconds), the bind procedure seems to stick to the specific address(111.111.111.245) and the last tried port(7300 here) and try to reuse this combination?111.111.111.245:7300?to perform bind again and again, instead of picking a random port between [ms_bind_port_min, ms_bind_port_max] just as it first does. Sadly, by coincidence, the specific combination?111.111.111.245:7300?is permanently occupied osd colocated with this one, which causes the rest attemps workless.
When I check the code for the bind procedure logic, things becomes clear:
int Accepter::bind(const entity_addr_t &bind_addr, const set<int>& avoid_ports)
...
_{
// try a range of ports
for (int port = msgr->cct->_conf->ms_bind_port_min; port <= msgr->cct->_conf->ms_bind_port_max; port++) {
if (avoid_ports.count(port))
continue;
listen_addr.set_port(port);
rc = ::bind(listen_sd, (struct sockaddr *) &listen_addr.ss_addr(), listen_addr.addr_size());
if (rc == 0)
break;
}
if (rc < 0) {
lderr(msgr->cct) << "accepter.bind unable to bind to " << listen_addr.ss_addr()
<< " on any port in range " << msgr->cct->conf->ms_bind_port_min
<< "-" << msgr->cct->_conf->ms_bind_port_max
<< ": " << cpp_strerror(errno)
<< dendl;
r = errno;
continue;
}
ldout(msgr>cct,10) << "accepter.bind bound on random port " << listen_addr << dendl;
}
That is if we cannot bind a specific address with any port in the specific range[ms_bind_port_min, ms_bind_port_max], we shall clear the specific port before retry, otherwishe we shall probably fail again.
My resolution here(For your information):
_// try a range of ports
for (int port = msgr->cct->_conf->ms_bind_port_min; port <= msgr->cct->_conf->ms_bind_port_max; port++) {
if (avoid_ports.count(port))
continue;
listen_addr.set_port(port);
rc = ::bind(listen_sd, (struct sockaddr *) &listen_addr.ss_addr(), listen_addr.addr_size());
if (rc == 0)
break;
}
if (rc < 0) {
lderr(msgr->cct) << "accepter.bind unable to bind to " << listen_addr.ss_addr()
<< " on any port in range " << msgr->cct->conf->ms_bind_port_min
<< "-" << msgr->cct->_conf->ms_bind_port_max
<< ": " << cpp_strerror(errno)
<< dendl;
r = -errno;
listen_addr.set_port(0); // clear port before retry, otherwise we shall fail again.
continue;
}
Updated by xie xingguo over 8 years ago
All we need is the following statement:
listen_addr.set_port(0); // clear port before retry, otherwise we shall fail again.++
Updated by Kefu Chai over 8 years ago
- Status changed from New to Fix Under Review
- Assignee set to xie xingguo
Updated by Kefu Chai over 8 years ago
- Status changed from Fix Under Review to Resolved