Project

General

Profile

Bug #6297

ceph osd tell * will break when FD limit reached, messenger should close pipes as necessary

Added by Brian Andrus over 10 years ago. Updated about 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In environments with a large number of OSD's (approaching or exceeding the file descriptor limit set), ceph osd tell * will start throwing errors once the fd limit is reached.

osd.1018: filestore_wbthrottle_enable = 'false'
osd.1019: filestore_wbthrottle_enable = 'false'
2013-09-11 11:24:30.007712 7f61b4453700 -1 -- x.x.23.100:0/1030951 >> x.x.23.102:6899/92790 pipe(0x7f6618002800 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f6618002a60).connect couldn't created socket Too many open files

History

#1 Updated by Samuel Just over 10 years ago

  • Tracker changed from Bug to Fix
  • Subject changed from ceph osd tell * will break when FD limit reached to ceph osd tell * will break when FD limit reached, messenger should close pipes as necessary

#2 Updated by Patrick Donnelly about 5 years ago

  • Tracker changed from Fix to Bug
  • Project changed from Ceph to RADOS
  • Status changed from New to Rejected
  • Regression set to No
  • Severity set to 3 - minor
  • Component(RADOS) OSD added

I suspect this isn't a problem anymore with systemd units allowing us to specify a larger number of file descriptors.

#3 Updated by Dan van der Ster about 5 years ago

Patrick Donnelly wrote:

I suspect this isn't a problem anymore with systemd units allowing us to specify a larger number of file descriptors.

I don't see how the systemd units are related. This ticket refers to the ceph cli leaving fd's open during a tell operation across several daemons. The problem is that the default ulimit -n (on centos/rhel, at least) for a user shell is 1024. The ceph cli runs out of fds, not the daemons.

So here is the behaviour on v12.2.8:

# ceph tell osd.* version
osd.0: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
osd.1: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
osd.2: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
...
osd.1032: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
osd.1033: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
osd.1034: {
    "version": "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)" 
}
2019-01-08 09:25:21.902345 7fb7589f5700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2019-01-08 09:25:22.103103 7fb7589f5700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2019-01-08 09:25:22.504095 7fb7589f5700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2019-01-08 09:25:23.305131 7fb7589f5700 -1 NetHandler create_socket couldn't create socket (24) Too many open files

Once ceph has sent the version cmd to osd.0, it should ideally close that fd before moving on to the next osd.

#4 Updated by Brad Hubbard over 2 years ago

  • Status changed from Rejected to In Progress
  • Assignee set to Brad Hubbard

This has come up again so I am going to reopen this tracker so I can follow up on the resolution.

#5 Updated by Frank Schilder about 1 year ago

Just run into this problem as well. I'm scraping OSD perf dumps to a file in a script and I get

# ./devel/check_pg_log_dups
Scraping OSD perf data to /root/osd-perf.json
2023-03-07T13:14:06.305+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2023-03-07T13:14:06.506+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2023-03-07T13:14:06.907+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2023-03-07T13:14:07.708+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2023-03-07T13:14:09.309+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files
2023-03-07T13:14:12.513+0100 7f3105e3f700 -1 NetHandler create_socket couldn't create socket (24) Too many open files

The offending command in the script is

ceph tell "osd.*" perf dump --format json-pretty

We have 1260 OSDs and the errors start popping up after around 1000 OSD perf dumps have been produced. Would be great if sockets were closed after use or if the number of concurrently open sockets was limited to a reasonable number below `ulimit -n`.

I guess the command still produces correct output, assuming the tell command is re-tried after popen failed?

Best regards,
Frank

#6 Updated by Brad Hubbard about 1 year ago

Hi Frank,

Can you confirm that increasing the file limit to some level just above 1260 (allowing for some miscellaneous connections and open files) allows the command to complete successfully?

Thanks in advance.

#7 Updated by Frank Schilder about 1 year ago

Hi Brad, yes I can. I tried with 1300 and it works fine. I added "ulimit -n 2048" to the script as a work-around.

I think mainly a customized error message when this happens would be nice. Something mentioning to increase the limit to suppress this message.

If this socket create error is handled gracefully (the tell command is re-tried after some time), that should be all that's needed.

Best regards,
Frank

PS: Our version is octopus. I guess its present in all versions.

#8 Updated by Brad Hubbard about 1 year ago

Thanks for the confirmation Frank. I'm revisiting this.

Also available in: Atom PDF