Project

General

Profile

Actions

Bug #21356

closed

ceph-mgr admin socket starts failing after many attempts to call nonexistent command

Added by John Spray over 6 years ago. Updated about 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
ceph-mgr
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Apparently if something is repeatedly calling nonexistent commands on the mgr's admin socket, it ends up with errno 24 and failing to service any admin socket at all from then onwards.

<nhm> oh my, my mgr log is 177G filled with "2017-09-11 10:05:43.794513 7f95f87de700 -1 asok(0x5557962761c0) AdminSocket: do_accept error: '(24) Too many open files
<jcsp> nhm: have you got something running that is hitting the admin socket?
<nhm> jcsp: yep, looks like it's due to cbt being dumb and telling it to dump_historic_ops
* excelle08 has quit (Ping timeout: 480 seconds)
<nhm> jcsp: that appears to be what set it off.
<jcsp> and admin socket is now just failing for any caller?
<nhm> jcsp: yep
<jcsp> I wonder if we're leaking something in the command not found path
Actions #1

Updated by John Spray over 6 years ago

At the same time as fixing this, let's fix the code in common/admin_socket that spins in AdminSocket::entry when do_accept returns (doesn't check return value of do_accept). Currently the spinning means it spews out monster log files when there is an issue.

Actions #2

Updated by Kefu Chai over 6 years ago

  • Status changed from New to Need More Info
  • Assignee set to Mark Nelson

Mark, what log messages were filling your disk? could you pastebin or just paste a sample of it? and what debug level were you using back then?

i've been running a loop of

while true;do ./bin/ceph daemon mgr.x dump_historic_ops; done

for two hours. out/mgr.x.log is 2.3MB. and it is still able to serve the asock command by replying

no valid command found; 10 closest matches:
mds_sessions
mds_requests
log dump
kick_stale_sessions
log reopen
log flush
get_command_descriptions
dump_mempools
help
git_version
admin_socket: invalid command

Actions #3

Updated by John Spray over 6 years ago

I think it was just the same " do_accept error: '(24) Too many open files'" message many times, right?

Actions #4

Updated by Sage Weil about 6 years ago

  • Status changed from Need More Info to Can't reproduce
Actions

Also available in: Atom PDF