Bug #18184: SimpleMessenger Pipe threads are spinning when idle - Messengers - Ceph

Actions

Copy link

Bug #18184

closed

SimpleMessenger Pipe threads are spinning when idle

Added by Greg Farnum over 7 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Immediate

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

jewel

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

See the ceph-users thread at https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34275.html
On upgrading, users are seeing their CPU usage go way up 15 minutes after OSDs boot, and sock_recvmesg is at the top when using perf.

We got a hint of this issue in #14120, but didn't realize how critical the bug was or that it was a new issue rather than a rare and untested one.

The actual problem is that in https://github.com/ceph/ceph/pull/8416, we changed Pipe::tcp_read_wait() to return -errno instead of "-1" when calling poll() and getting a return value <=0. The intention was to return the error on failure, but the actual return value spec is

On success, a positive number is returned; this is the number of structures which have nonzero revents fields (in other words, those descriptors with events or errors reported).  A value of 0 indicates that the call timed out and no file descriptors were ready.  On error, -1 is returned, and errno is set appropriately.

This means we get 0 on timed-out sockets! And return -errno in that case means -0 (assuming it was initialized; otherwise we negate a garbage value that makes the outcome positive half the time!).
Then, Pipe::tcp_read() checks tcp_read_wait()'s return value is <0 to error out; otherwise it continues in to tcp_read_nonblocking(), which will loop on calling recv() as long as it gets back EAGAIN or EINTR (because it expects we have already validated there is data available to read).

To fix this, we need to translate the "0" response from poll() into an error code.

Related issues 1 (0 open — 1 closed)