Bug #2596
mds: spinning on restart
Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
from ML:
On Fri, 15 Jun 2012, Amon Ott wrote: > Hello all, > > I have seen this for a long time, but never investigated further. After stabl$ > test runs for several days, this is our last known show stopper before using > Ceph in production. We are running 0.47.2 on 32 Bit. > > If we restart MDS (or all ceph daemons) on all nodes, one after another or al$ > together, they first recover and then the active one starts to spin with full > cpu and does not answer any more. After a while, the next takes over, starts > to spin, etc., until the whole cluster is unusable. This is completely > reproducable and happens even without any active client. > > As ecpected, ceph -w shows lots of > "2012-06-15 11:35:28.588775 mds e959: 1/1/1 up {0=3=up:active(laggy or > crashed)}" > > It does not help to stop all services on all nodes for minutes or longer and > to restart them - MDS will restart spinning. But: If we reboot the whole > cluster, everything goes back to work. > > Today's MDS log is available at > https://download.m-privacy.de/homeuser-mds.0.log.gz > > Is this a known problem? It has been with us for a looong time now, but since > rebooting used to help, we never tracked it down.
History
#1 Updated by Amon Ott over 11 years ago
gdb is not helpful here, process seems to be spinning in syscall:
(gdb) thread apply all bt
Thread 1 (process 14820):
#0 0x55ba0422 in __kernel_vsyscall ()
#1 0x5574df4b in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
strace shows:
futex(0x16b44d1c, FUTEX_WAIT_PRIVATE, 1, NULL
So it seems that a futex deadlock occurred. It might be that it only happens if a client had been connected before.
Will try to get a full strace with spin tomorrow.
#2 Updated by Sage Weil about 11 years ago
- Status changed from New to Can't reproduce
#3 Updated by John Spray about 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.