Project

General

Profile

Actions

Bug #6492

closed

SignalHandler::entry spins the cpu and never sleeps

Added by Alan Somers over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

SignalHandler::entry shouldn't poll for POLLOUT, because it never actually writes to the pipes in question. Polling for POLLOUT causes poll(2) to immediately return, so the function spins the CPU and never blocks. At least, that's my experience with ceph-mon on FreeBSD. I don't know why it works on Linux. In any case, the attached patch removes POLLOUT. It fixes the problem on FreeBSD and doesn't break anything (AFAICT) on Linux. Tested on FreeBSD 9.1 amd64 and Ubuntu Server 13.04 amd64.

Signed-off-by: Alan Somers <>


Files

patch-src-global-signal_handler.cc (653 Bytes) patch-src-global-signal_handler.cc Alan Somers, 10/08/2013 03:24 PM
Actions #1

Updated by Noah Watkins over 10 years ago

I believe this is also a bug I'm seeing on OSX.

Actions #2

Updated by Greg Farnum over 10 years ago

This is outside my area of expertise, but I'm looking at http://linux.die.net/man/3/poll and not seeing why POLLOUT should cause the behavior you're describing. Does its meaning vary on BSD systems, or can you tell me more about why it's a problem?
(I don't doubt that removing it removes the visible bug, but I want to understand the situation to make sure we aren't introducing more subtle issues!)

Actions #3

Updated by Alan Somers over 10 years ago

In signal_handler.cc at line 213, poll(2) returns immediately with POLLOUT set in revents for each of the struct pollfd entries, telling the process that each of those pipes may be written to without blocking. At lines 220 and 225 we try to read(2) from each of those pipes, but it fails because none of them have any data available for reading. Then we loop and try poll(2) again. We didn't do anything to change the state of the pipes, so they are still available for writing, and so poll(2) returns immediately again.

I don't know why the code works on Linux. Perhaps it's due to a difference in the way pipe buffering works? If FreeBSD allows a process to write to a pipe before any process has opened that pipe for reading, but Linux requires the pipe first be opened for reading, then that would explain the discrepancy. But that's just a guess, and not a very good one.

Actions #4

Updated by Noah Watkins over 10 years ago

I found that the daemons on OSX were spinning not because they were not blocking in poll, but because the unnamed semaphore used in CephContextServiceThread is borked on OSX. Namely, sem_init() always returns an error. To fix, I used the Mach semaphores. Not sure if this is a problem on FreeBSD, or not.

Actions #5

Updated by Greg Farnum over 10 years ago

  • Status changed from New to Pending Backport
  • Assignee set to Ian Colle

Thanks Alan, that looks good to me and going over the issue again I'm not sure why it works on Linux to begin with! Googling had a few people seeing similar strange behavior when using POLLOUT code across Linux and FreeBSD, but nothing directly analogous or with a clear solution/explanation.
But since as you say we never write to these FDs, and the selection looks to have been pretty random when switching from select() to poll() (I asked Yehuda, who reviewed the patch in question), I think we're good just ditching them. Patch applied to master in 6641273b19914a3af098bb3005724bec481e6ce3. Thanks!

Sage/Ian, as a FreeBSD fix is this something we want to backport anywhere?

Actions #6

Updated by Alan Somers over 10 years ago

Recent Ceph checkouts don't even compile on FreeBSD. Only the wip-port branch can compile. So I don't think there's any need to backport the fix to older Ceph versions.

Actions #7

Updated by Greg Farnum over 10 years ago

  • Status changed from Pending Backport to Resolved
  • Assignee deleted (Ian Colle)

Good enough.

Actions

Also available in: Atom PDF