Project

General

Profile

Actions

Bug #17573

closed

librados doesn't properly report failure to create socket

Added by Ben England over 7 years ago. Updated about 7 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a really large Jewel cluster with > 1000 OSDs, I see rados bench hang on a pool with this command. It starts off ok but it gets an error saying it couldn't create a socket : too many open files. I'm guessing it is trying to open a socket to an OSD. If I run it with "ulimit -n 131072", this doesn't happen. My point is that it should exit when it gets this error and not just hang, so that CBT doesn't hang, and we can get some kind of conclusion about the health of the cluster.

Is this a problem bigger than the rados utility? If so, it could impact Ceph scalability with OSD count.

To reproduce in a smaller cluster, you might have to lower the open file descriptor limit drastically.

workload:
/usr/bin/rados c /etc/ceph/ceph.conf -p rados-bench-cbt bench -b 4194304 30 write --concurrent-ios 16 --run-name `hostname -s`-0 --no-cleanup
=>>ulimit n 1024 ; /usr/bin/rados -c /etc/ceph/ceph.conf -p rados-bench-cbt bench -b 4194304 30 write --concurrent-ios 16 --run-name `hostname -s`-0 --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 30 seconds or 0 objects
Object prefix: benchmark_data_c02-h17-r620.rdu.openstack.en_48662
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 287 271 1083.94 1084 0.0389778 0.0567504
2 16 567 551 1101.88 1120 0.110503 0.0570464
3 16 846 830 1106.53 1116 0.0659647 0.057096
4 16 1127 1111 1110.86 1124 0.0575751 0.0570865
5 16 1407 1391 1112.65 1120 0.0415876 0.0571044
6 16 1687 1671 1113.84 1120 0.0728675 0.0569945
7 16 1969 1953 1115.84 1128 0.0406718 0.0570041
8 16 2250 2234 1116.84 1124 0.0374575 0.0570422
9 16 2529 2513 1116.73 1116 0.0466408 0.0570168
10 16 2813 2797 1118.64 1136 0.0933425 0.0570534
11 16 3090 3074 1117.66 1108 0.0443647 0.0570197
12 16 3370 3354 1117.84 1120 0.0398344 0.0570368
13 16 3655 3639 1119.53 1140 0.0525464 0.0570459
14 16 3934 3918 1119.27 1116 0.0366038 0.0570136
15 16 4215 4199 1119.58 1124 0.0470502 0.0570443
16 16 4496 4480 1119.84 1124 0.0656467 0.0570273
2016-10-13 18:17:35.702165 7f893ae34700 -1 -
172.18.66.49:0/1757557866 >> 172.18.64.232:6904/113736 pipe(0x7f8cda701810 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8cda43f870).connect couldn't created socket (24) Too many open files
2016-10-13 18:17:35.702299 7f893ae34700 0 -- 172.18.66.49:0/1757557866 >> 172.18.64.232:6904/113736 pipe(0x7f8cda701810 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8cda43f870).fault
2016-10-13 18:17:35.702393 7f893ae34700 1 - 172.18.66.49:0/1757557866 >> 172.18.64.232:6904/113736 pipe(0x7f8cda701810 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8cda43f870).connect couldn't created socket (24) Too many open files
2016-10-13 18:17:35.725879 7f893ad33700 1 - 172.18.66.49:0/1757557866 >> 172.18.65.170:6928/218835 pipe(0x7f8cda7044c0 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8cda6f2990).connect couldn't created socket (24) Too many open files
2016-10-13 18:17:35.725962 7f893ad33700 0 -- 172.18.66.49:0/1757557866 >> 172.18.65.170:6928/218835 pipe(0x7f8cda7044c0 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f8cda6f2990).fault

There is just one rados bench process running, nothing else going on in the cluster.

-=>>ceph -s
cluster 6942aa60-f4c4-4954-88d7-c67e0d0a0df5
health HEALTH_OK
monmap e1: 5 mons at {c04-h33-6018r=172.18.65.121:6789/0,c05-h33-6018r=172.18.65.118:6789/0,c06-h29-6018r=172.18.64.251:6789/0,c07-h29-6018r=172.18.64.226:6789/0,c07-h30-6018r=172.18.64.238:6789/0}
election epoch 6, quorum 0,1,2,3,4 c07-h29-6018r,c07-h30-6018r,c06-h29-6018r,c05-h33-6018r,c04-h33-6018r
fsmap e6: 1/1/1 up {0=c07-h29-6018r=up:active}
osdmap e2149: 1043 osds: 1043 up, 1043 in
flags sortbitwise
pgmap v7937: 27752 pgs, 10 pools, 241 GB data, 61964 objects
1141 GB used, 1895 TB / 1896 TB avail
27752 active+clean

pool 22 'rados-bench-cbt' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 16384 pgp_num 16384 last_change 2149 flags hashpspool stripe_width 0


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Feature #15308: LIBRADOS modify Pipe::connect() to return the error codeResolved06/15/2016

Actions
Actions

Also available in: Atom PDF