Project

General

Profile

Actions

Bug #12655

closed

unable to bind to IP:PORT on any port in range 6800-7300: (98) Address already in us

Added by karan singh over 8 years ago. Updated over 8 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
OSD
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi There

In my production ceph cluster several OSD's ( ~ 20 OSD ) crashed throwing this error

ceph-osd.34.log:2015-08-10 15:55:32.654670 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use

With this same error other OSD also crashed

[root@node-s01 ceph]# grep -i unable *.log
ceph-osd.34.log:2015-08-10 15:55:32.654670 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.34.log:2015-08-10 15:55:32.655310 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.34.log:2015-08-10 15:55:32.655769 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.34.log:    -4> 2015-08-10 15:55:32.654670 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.34.log:    -3> 2015-08-10 15:55:32.655310 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.34.log:    -2> 2015-08-10 15:55:32.655769 7f792ea4c700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:2015-08-10 15:55:31.149299 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:2015-08-10 15:55:31.150890 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:2015-08-10 15:55:31.154187 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:    -4> 2015-08-10 15:55:31.149299 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:    -3> 2015-08-10 15:55:31.150890 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.38.log:    -2> 2015-08-10 15:55:31.154187 7f31a14bf700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.41.log:2015-08-10 07:30:49.837903 7fe825c26700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.41.log:    -2> 2015-08-10 07:30:49.837903 7fe825c26700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.45.log:2015-08-10 15:55:31.147666 7f647fb54700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.45.log:2015-08-10 15:55:31.149575 7f647fb54700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.45.log:    -3> 2015-08-10 15:55:31.147666 7f647fb54700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.45.log:    -2> 2015-08-10 15:55:31.149575 7f647fb54700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:2015-08-10 15:55:31.199112 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:2015-08-10 15:55:31.201311 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:2015-08-10 15:55:31.204034 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:    -4> 2015-08-10 15:55:31.199112 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:    -3> 2015-08-10 15:55:31.201311 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.55.log:    -2> 2015-08-10 15:55:31.204034 7fa2b02d6700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.58.log:2015-08-10 15:55:31.162660 7fd477c52700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.58.log:2015-08-10 15:55:31.166120 7fd477c52700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.58.log:    -3> 2015-08-10 15:55:31.162660 7fd477c52700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.58.log:    -2> 2015-08-10 15:55:31.166120 7fd477c52700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.8.log:2015-08-10 07:30:51.730846 7fe5e67f7700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.8.log:2015-08-10 07:30:51.731851 7fe5e67f7700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.8.log:    -3> 2015-08-10 07:30:51.730846 7fe5e67f7700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
ceph-osd.8.log:    -2> 2015-08-10 07:30:51.731851 7fe5e67f7700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
[root@node-s01 ceph]#

Here is the backtrack from one of the OSD

   -10> 2015-08-10 12:38:02.766359 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.33 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
    -9> 2015-08-10 12:38:02.766423 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.50 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
    -8> 2015-08-10 12:38:02.766433 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.134 ever on either front or back, first ping sent 2015-08-10 12:37:23.469422 (cutoff 2015-08-10 12:37:42.766354)
    -7> 2015-08-10 12:38:02.766446 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.200 ever on either front or back, first ping sent 2015-08-10 12:37:15.361731 (cutoff 2015-08-10 12:37:42.766354)
    -6> 2015-08-10 12:38:02.766454 7faa0abce700 -1 osd.60 39761 heartbeat_check: no reply from osd.228 ever on either front or back, first ping sent 2015-08-10 12:37:00.655566 (cutoff 2015-08-10 12:37:42.766354)
    -5> 2015-08-10 12:38:03.259647 7fa9b5b9a700  0 -- 10.100.50.2:0/82807 >> 10.100.50.4:7142/147030592 pipe(0x4ff3200 sd=399 :0 s=1 pgs=0 cs=0 l=1 c=0x44b3de0).fault
    -4> 2015-08-10 12:38:03.259682 7fa9b5594700  0 -- 10.100.50.2:0/82807 >> 10.100.50.1:7204/408026440 pipe(0xf278f00 sd=411 :0 s=1 pgs=0 cs=0 l=1 c=0x44b7bc0).fault
    -3> 2015-08-10 12:38:03.271675 7fa9ecda2700  0 log [WRN] : map e39763 wrongly marked me down
    -2> 2015-08-10 12:38:03.306073 7fa9ecda2700 -1 accepter.accepter.bind unable to bind to 10.100.50.2:7300 on any port in range 6800-7300: (98) Address already in use
    -1> 2015-08-10 12:38:03.368817 7fa9ecda2700  0 osd.60 39763 prepare_to_stop starting shutdown
     0> 2015-08-10 12:38:03.372071 7fa9ecda2700 -1 common/Mutex.cc: In function 'void Mutex::Lock(bool)' thread 7fa9ecda2700 time 2015-08-10 12:38:03.368886
common/Mutex.cc: 93: FAILED assert(r == 0)

 ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
 1: (Mutex::Lock(bool)+0x1d3) [0xa83003]
 2: (OSD::shutdown()+0x63) [0x63f3f3]
 3: (OSD::handle_osd_map(MOSDMap*)+0x1829) [0x64dff9]
 4: (OSD::_dispatch(Message*)+0x2fb) [0x6600eb]
 5: (OSD::ms_dispatch(Message*)+0x211) [0x6607b1]
 6: (DispatchQueue::entry()+0x5a2) [0xb5ac12]
 7: (DispatchQueue::DispatchThread::entry()+0xd) [0xaf23ad]
 8: /lib64/libpthread.so.0() [0x35952079d1]
 9: (clone()+0x6d) [0x3594ee89dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Environment Details
  • ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
  • Kernel : 2.6.32-431.el6.x86_64
  • CentOS release 6.5 (Final)
  • I have 4 OSD nodes but just 2 of them has shown this error

After restarting the crashed OSD's this error did not came, Currently the cluster is running with NOOUT,NODOWN flags and the current HEALTH is OK. I am not sure if i unset noout , nodown the cluster OSD's will not die again.

In tracker.ceph.com there are #3816 and #10494 which are not directly but slightly releated to this problem.

Need Help ( i dont have Debug 20 logs )


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #12654: unable to bind to IP:PORT on any port in range 6800-7300: (98) Address already in usDuplicate08/10/2015

Actions
Actions #1

Updated by karan singh over 8 years ago

HI

Today also i have encountered similar OSD error and OSD crashed with this error.

Need help to fix this problem.

2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
2015-08-11 16:01:19.618929 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
    -4> 2015-08-11 16:01:19.617860 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
    -3> 2015-08-11 16:01:19.618929 7f3d95219700 -1 accepter.accepter.bind unable to bind to 10.100.50.1:7300 on any port in range 6800-7300: (98) Address already in use
Actions #2

Updated by Sage Weil over 8 years ago

YOu can increase the range of ports we try to bind to with ms_bind_port_min (6800) and ms_bind_port_max (7300). You must have a lot of OSDs on the same box?

Actions #3

Updated by karan singh over 8 years ago

Thanks sage , setting a larger port range fixes this.

Before closing this ticket could you help me understanding this.

There are 60 OSD on the node , however netstat says that OSD's are consuming 300 ports

[root@node1 ~]# df -h | grep -i osd | wc -l
60
[root@node1 ~]# netstat -plunt | grep -i osd | wc -l
300

As per Ceph documentation,Each Ceph OSD Daemon on a Ceph Node may use up to three ports.

so 3 X 60 = 180 Ports , in my case its 300 ... why my ceph node is consuming more ports ?

http://ceph.com/docs/master/rados/configuration/network-config-ref/#osd-ip-tables

Actions #4

Updated by Sage Weil over 8 years ago

karan singh wrote:

Thanks sage , setting a larger port range fixes this.

Before closing this ticket could you help me understanding this.

There are 60 OSD on the node , however netstat says that OSD's are consuming 300 ports

[...]

As per Ceph documentation,Each Ceph OSD Daemon on a Ceph Node may use up to three ports.

so 3 X 60 = 180 Ports , in my case its 300 ... why my ceph node is consuming more ports ?

http://ceph.com/docs/master/rados/configuration/network-config-ref/#osd-ip-tables

It's actually 4 ports, which takes you to 240. Not sure why it couldn't find a free one... maybe oSDs were restarting and weren't bindable yet?

Actions #5

Updated by Sage Weil over 8 years ago

  • Status changed from New to Rejected
Actions

Also available in: Atom PDF