Bug #14716

" 143: FAILED assert(status == 0)" in fs-hammer---basic-smithi

Added by Yuri Weinstein about 5 years ago. Updated over 4 years ago.

Won't Fix
Target version:
% Done:


3 - minor
Affected Versions:
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):



2016-02-09T14:17:25.520 INFO:tasks.ceph.mds.a.mira035.stderr:common/ In function 'int Thread::join(void**)' thread 7f1f3372e700 time 2016-02-09 22:17:25.504583
2016-02-09T14:17:25.520 INFO:tasks.ceph.mds.a.mira035.stderr:common/ 143: FAILED assert(status == 0)
2016-02-09T14:17:25.520 INFO:tasks.ceph.mds.a.mira035.stderr: ceph version 0.94.5-305-g7abb6ae (7abb6ae8f3cba67009bd022aaeee0a87cdfc6477)
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x94c72b]
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 2: (Thread::detach()+0) [0x938730]
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 3: (Finisher::stop()+0x9d) [0x87e6cd]
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 4: (MDS::suicide()+0x85) [0x59fa35]
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 5: (C_MDL_WriteError::finish(int)+0x65) [0x7da085]
2016-02-09T14:17:25.521 INFO:tasks.ceph.mds.a.mira035.stderr: 6: (MDSIOContextBase::complete(int)+0x81) [0x7c7971]
2016-02-09T14:17:25.522 INFO:tasks.ceph.mds.a.mira035.stderr: 7: (Finisher::finisher_thread_entry()+0x1a0) [0x87f0e0]
2016-02-09T14:17:25.522 INFO:tasks.ceph.mds.a.mira035.stderr: 8: (()+0x8182) [0x7f1f3b4b2182]
2016-02-09T14:17:25.522 INFO:tasks.ceph.mds.a.mira035.stderr: 9: (clone()+0x6d) [0x7f1f39c2147d]
2016-02-09T14:17:25.522 INFO:tasks.ceph.mds.a.mira035.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


#1 Updated by Yuri Weinstein about 5 years ago

  • Related to Bug #14697: mds: assert in SafeTimer while suiciding added

#2 Updated by Greg Farnum about 5 years ago

  • Related to deleted (Bug #14697: mds: assert in SafeTimer while suiciding)

#3 Updated by Greg Farnum about 5 years ago

This one's odd. The problem in #14697 is different; it's actually calling timer.shutdown() twice there. Here, that isn't happening. Moreover, the assert it's hitting on line 143 in Thread::join() doesn't match the backtrace, which claims to be in Thread::detach().

But both instances are happening when the Journaler gets ENOSPC. I wonder if MDS::suicide() is getting invoked twice? The finisher is the first one that gets join()ed in that process.

#4 Updated by Greg Farnum about 5 years ago

Nope, that's not it directly, we only propagate one at a time thanks to Journaler::handle_write_error().

This is apparently pretty easy to have recurring now? But the earliest failure I can find is The only commit between a passing run1 and that one that's obviously part of the FS stuff is the fsx qa script, which I don't think caused it. ;) There is also 2817ffcf4e57f92551b86388681fc0fe70c386ec in ReplicatedPG which changes the full behavior some; I wonder if that's broken the semantics or activated a new path in the MDS which is causing issues?


#6 Updated by Greg Farnum about 5 years ago

Well at least it's consistent. Can you also try commit:2c8e57934284dae0ae92d1aa0839a87092ec7c51 against smithi/mira?
If that passes, a commit bisect should tell us which patch broke stuff.

#7 Updated by Yuri Weinstein about 5 years ago

Greg the test passed on that commit.

Here is what I did for memory and make sure that it was good.


git clone
git checkout 2c8e57934284dae0ae92d1aa0839a87092ec7c51
git checkout -b wip-test-14716-2
git push origin wip-test-14716-2



filter="fs/recovery/{clusters/2-remote-clients.yaml debug/mds_client.yaml mounts/ceph-fuse.yaml tasks/mds-full.yaml}" 

#8 Updated by Yuri Weinstein about 5 years ago

Tests on 2817ffcf4e57f92551b86388681fc0fe70c386ec in ReplicatedPG commit failed all in similar way => .

2016-02-18T17:23:08.607 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603875 7f10be93c700 -1 mds.0.journaler(rw) _finish_flush got (28) No space left on device
2016-02-18T17:23:08.607 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603894 7f10be93c700 -1 mds.0.journaler(rw) handle_write_error (28) No space left on device
2016-02-18T17:23:08.608 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603918 7f10be93c700 -1 mds.0.log unhandled error (28) No space left on device, shutting down...
2016-02-18T17:23:08.609 INFO:tasks.ceph.mds.a.smithi051.stderr:common/ In function 'int Thread::join(void**)' thread 7f10be93c700 time 2016-02-19 01:23:08.603946
2016-02-18T17:23:08.609 INFO:tasks.ceph.mds.a.smithi051.stderr:common/ 143: FAILED assert(status == 0)
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: ceph version 0.94.5-243-g2817ffc (2817ffcf4e57f92551b86388681fc0fe70c386ec)
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x94c72b]
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: 2: (Thread::detach()+0) [0x938730]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 3: (Finisher::stop()+0x9d) [0x87e6cd]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 4: (MDS::suicide()+0x85) [0x59fa35]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 5: (C_MDL_WriteError::finish(int)+0x65) [0x7da085]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 6: (MDSIOContextBase::complete(int)+0x81) [0x7c7971]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 7: (Finisher::finisher_thread_entry()+0x1a0) [0x87f0e0]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 8: (()+0x8182) [0x7f10c66c0182]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 9: (clone()+0x6d) [0x7f10c4e2f47d]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#10 Updated by Yuri Weinstein about 5 years ago

  • Project changed from CephFS to Ceph

#11 Updated by Yuri Weinstein about 5 years ago

  • Project changed from Ceph to CephFS

corresponding issue #14824

#12 Updated by Greg Farnum about 5 years ago

  • Status changed from New to Won't Fix

This was a result of the OSD full handling changes, which got partly backported. I think the resolution we ended up at was "too bad"?

Also available in: Atom PDF