https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2016-02-09T23:59:10ZCeph CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=656952016-02-09T23:59:10ZYuri Weinsteinyweinste@redhat.com
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/14697">Bug #14697</a>: mds: assert in SafeTimer while suiciding</i> added</li></ul> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=658072016-02-11T23:39:56ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Related to</strong> deleted (<i><a class="issue tracker-1 status-3 priority-4 priority-default closed" href="/issues/14697">Bug #14697</a>: mds: assert in SafeTimer while suiciding</i>)</li></ul> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=658092016-02-11T23:48:31ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>This one's odd. The problem in <a class="issue tracker-1 status-3 priority-4 priority-default closed" title="Bug: mds: assert in SafeTimer while suiciding (Resolved)" href="https://tracker.ceph.com/issues/14697">#14697</a> is different; it's actually calling timer.shutdown() twice there. Here, that isn't happening. Moreover, the assert it's hitting on line 143 in Thread::join() doesn't match the backtrace, which claims to be in Thread::detach().</p>
<p>But both instances are happening when the Journaler gets ENOSPC. I wonder if MDS::suicide() is getting invoked twice? The finisher is the first one that gets join()ed in that process.</p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=658112016-02-12T00:17:17ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Nope, that's not it directly, we only propagate one at a time thanks to Journaler::handle_write_error().</p>
<p>This is apparently pretty easy to have recurring now? But the earliest failure I can find is <a class="external" href="http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-02-07_17:00:02-fs-hammer---basic-openstack/19350/">http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-02-07_17:00:02-fs-hammer---basic-openstack/19350/</a>. The only commit between a passing run<sup><a href="#fn1">1</a></sup> and that one that's obviously part of the FS stuff is the fsx qa script, which I don't think caused it. ;) There is also 2817ffcf4e57f92551b86388681fc0fe70c386ec in ReplicatedPG which changes the full behavior some; I wonder if that's broken the semantics or activated a new path in the MDS which is causing issues?</p>
<p>[1]:http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-01-31_17:00:01-fs-hammer---basic-openstack/14718/</p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=659632016-02-16T17:31:54ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>Tried on diff machines all failed in similar fashion:<br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-15_16:52:37-fs-hammer---basic-smithi/">http://pulpito.ceph.com/teuthology-2016-02-15_16:52:37-fs-hammer---basic-smithi/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-15_14:34:12-fs-hammer---basic-vps/">http://pulpito.ceph.com/teuthology-2016-02-15_14:34:12-fs-hammer---basic-vps/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-15_13:41:15-fs-hammer---basic-mira/">http://pulpito.ceph.com/teuthology-2016-02-15_13:41:15-fs-hammer---basic-mira/</a></p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=660432016-02-18T01:11:34ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Well at least it's consistent. Can you also try commit:2c8e57934284dae0ae92d1aa0839a87092ec7c51 against smithi/mira?<br />If that passes, a commit bisect should tell us which patch broke stuff.</p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=660962016-02-18T22:31:11ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>Greg the test passed on that commit.</p>
<p><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-18_13:42:45-fs-wip-test-14716-2---basic-smithi/">http://pulpito.ceph.com/teuthology-2016-02-18_13:42:45-fs-wip-test-14716-2---basic-smithi/</a></p>
<p>Here is what I did for memory and make sure that it was good.</p>
<p>Git:<br /><pre>
git clone https://github.com/ceph/ceph/
git checkout 2c8e57934284dae0ae92d1aa0839a87092ec7c51
git checkout -b wip-test-14716-2
git push origin wip-test-14716-2
</pre></p>
<p>Gitbuilders:</p>
<pre>
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-rpm-centos7-amd64-basic/rebuild.cgi?log=2c8e57934284dae0ae92d1aa0839a87092ec7c51
http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-deb-trusty-amd64-basic/rebuild.cgi?log=2c8e57934284dae0ae92d1aa0839a87092ec7c51
</pre>
<p>teuthology:<br /><pre>
filter="fs/recovery/{clusters/2-remote-clients.yaml debug/mds_client.yaml mounts/ceph-fuse.yaml tasks/mds-full.yaml}"
</pre></p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=661132016-02-19T16:01:51ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>Tests on 2817ffcf4e57f92551b86388681fc0fe70c386ec in ReplicatedPG commit failed all in similar way => .<br /><pre>
2016-02-18T17:23:08.607 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603875 7f10be93c700 -1 mds.0.journaler(rw) _finish_flush got (28) No space left on device
2016-02-18T17:23:08.607 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603894 7f10be93c700 -1 mds.0.journaler(rw) handle_write_error (28) No space left on device
2016-02-18T17:23:08.608 INFO:tasks.ceph.mds.a.smithi051.stderr:2016-02-19 01:23:08.603918 7f10be93c700 -1 mds.0.log unhandled error (28) No space left on device, shutting down...
2016-02-18T17:23:08.609 INFO:tasks.ceph.mds.a.smithi051.stderr:common/Thread.cc: In function 'int Thread::join(void**)' thread 7f10be93c700 time 2016-02-19 01:23:08.603946
2016-02-18T17:23:08.609 INFO:tasks.ceph.mds.a.smithi051.stderr:common/Thread.cc: 143: FAILED assert(status == 0)
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: ceph version 0.94.5-243-g2817ffc (2817ffcf4e57f92551b86388681fc0fe70c386ec)
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x94c72b]
2016-02-18T17:23:08.610 INFO:tasks.ceph.mds.a.smithi051.stderr: 2: (Thread::detach()+0) [0x938730]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 3: (Finisher::stop()+0x9d) [0x87e6cd]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 4: (MDS::suicide()+0x85) [0x59fa35]
2016-02-18T17:23:08.611 INFO:tasks.ceph.mds.a.smithi051.stderr: 5: (C_MDL_WriteError::finish(int)+0x65) [0x7da085]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 6: (MDSIOContextBase::complete(int)+0x81) [0x7c7971]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 7: (Finisher::finisher_thread_entry()+0x1a0) [0x87f0e0]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 8: (()+0x8182) [0x7f10c66c0182]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: 9: (clone()+0x6d) [0x7f10c4e2f47d]
2016-02-18T17:23:08.612 INFO:tasks.ceph.mds.a.smithi051.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre></p>
<p><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-18_17:17:37-fs-wip-test-14716-3---basic-smithi/">http://pulpito.ceph.com/teuthology-2016-02-18_17:17:37-fs-wip-test-14716-3---basic-smithi/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-18_16:58:42-fs-wip-test-14716-3---basic-vps/">http://pulpito.ceph.com/teuthology-2016-02-18_16:58:42-fs-wip-test-14716-3---basic-vps/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-18_16:55:48-fs-wip-test-14716-3---basic-vps/">http://pulpito.ceph.com/teuthology-2016-02-18_16:55:48-fs-wip-test-14716-3---basic-vps/</a></p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=661202016-02-19T16:49:05ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>Next try commit 951339103d35bc8ee2de880f77aada40d15b592a</p>
<p>passed</p>
<p><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-19_10:07:15-fs-wip-test-14716-4---basic-smithi/">http://pulpito.ceph.com/teuthology-2016-02-19_10:07:15-fs-wip-test-14716-4---basic-smithi/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-19_10:06:33-fs-wip-test-14716-4---basic-vps/">http://pulpito.ceph.com/teuthology-2016-02-19_10:06:33-fs-wip-test-14716-4---basic-vps/</a><br /><a class="external" href="http://pulpito.ceph.com/teuthology-2016-02-19_10:41:01-fs-wip-test-14716-4---basic-mira/">http://pulpito.ceph.com/teuthology-2016-02-19_10:41:01-fs-wip-test-14716-4---basic-mira/</a></p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=661322016-02-19T19:39:56ZYuri Weinsteinyweinste@redhat.com
<ul><li><strong>Project</strong> changed from <i>CephFS</i> to <i>Ceph</i></li></ul> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=661332016-02-19T19:40:31ZYuri Weinsteinyweinste@redhat.com
<ul><li><strong>Project</strong> changed from <i>Ceph</i> to <i>CephFS</i></li></ul><p>corresponding issue <a class="issue tracker-9 status-6 priority-6 priority-high2 closed" title="Backport: hammer: rbd and pool quota do not go well together (Rejected)" href="https://tracker.ceph.com/issues/14824">#14824</a></p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=692382016-04-14T22:28:16ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Won't Fix</i></li></ul><p>This was a result of the OSD full handling changes, which got partly backported. I think the resolution we ended up at was "too bad"?</p> CephFS - Bug #14716: "Thread.cc: 143: FAILED assert(status == 0)" in fs-hammer---basic-smithihttps://tracker.ceph.com/issues/14716?journal_id=769482016-08-19T16:18:40ZYuri Weinsteinyweinste@redhat.com
<ul></ul><p>Same in hammer 0.94.8<br /><a class="external" href="http://qa-proxy.ceph.com/teuthology/yuriw-2016-08-18_20:11:00-fs-master---basic-smithi/373246/teuthology.log">http://qa-proxy.ceph.com/teuthology/yuriw-2016-08-18_20:11:00-fs-master---basic-smithi/373246/teuthology.log</a></p>