Bug #21356
closedceph-mgr admin socket starts failing after many attempts to call nonexistent command
0%
Description
Apparently if something is repeatedly calling nonexistent commands on the mgr's admin socket, it ends up with errno 24 and failing to service any admin socket at all from then onwards.
<nhm> oh my, my mgr log is 177G filled with "2017-09-11 10:05:43.794513 7f95f87de700 -1 asok(0x5557962761c0) AdminSocket: do_accept error: '(24) Too many open files <jcsp> nhm: have you got something running that is hitting the admin socket? <nhm> jcsp: yep, looks like it's due to cbt being dumb and telling it to dump_historic_ops * excelle08 has quit (Ping timeout: 480 seconds) <nhm> jcsp: that appears to be what set it off. <jcsp> and admin socket is now just failing for any caller? <nhm> jcsp: yep <jcsp> I wonder if we're leaking something in the command not found path
Updated by John Spray over 6 years ago
At the same time as fixing this, let's fix the code in common/admin_socket that spins in AdminSocket::entry when do_accept returns (doesn't check return value of do_accept). Currently the spinning means it spews out monster log files when there is an issue.
Updated by Kefu Chai over 6 years ago
- Status changed from New to Need More Info
- Assignee set to Mark Nelson
Mark, what log messages were filling your disk? could you pastebin or just paste a sample of it? and what debug level were you using back then?
i've been running a loop of
while true;do ./bin/ceph daemon mgr.x dump_historic_ops; done
for two hours. out/mgr.x.log is 2.3MB. and it is still able to serve the asock command by replying
no valid command found; 10 closest matches: mds_sessions mds_requests log dump kick_stale_sessions log reopen log flush get_command_descriptions dump_mempools help git_version admin_socket: invalid command
Updated by John Spray over 6 years ago
I think it was just the same " do_accept error: '(24) Too many open files'" message many times, right?
Updated by Sage Weil about 6 years ago
- Status changed from Need More Info to Can't reproduce