https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2013-06-24T09:38:26ZCeph CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=238922013-06-24T09:38:26ZIan Colleicolle@redhat.com
<ul><li><strong>Priority</strong> changed from <i>Normal</i> to <i>High</i></li></ul> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=238952013-06-24T09:45:52ZJosh Durgin
<ul><li><strong>Category</strong> set to <i>47</i></li></ul><p>If you look at the message from the first exception, it says the mds failed:</p>
<pre>
CommandFailedError: Command failed on 10.214.133.24 with status 1: '/home/ubuntu/cephtest/38877/enable-coredump ceph-coverage /home/ubuntu/cephtest/38877/archive/coverage sudo /home/ubuntu/cephtest
/38877/daemon-helper kill ceph-mds -f -i b-s-a'
</pre>
<p>The bad object dereference might be a bug in the mds_thrash task, but the root cause here is an MDS crash.</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=238972013-06-24T10:05:46ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Happened again<br /><pre>2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:joining mds_thrashers
2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:join thrasher for failure group [a, b-s-a]
2013-06-23T04:10:12.179 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x235f650>
Traceback (most recent call last):
File "/home/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 317, in task
thrashers[t].do_join()
File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 107, in do_join
self.thread.get()
File "/home/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get
raise self._exception
TypeError: 'NoneType' object has no attribute '__getitem__'
</pre><br />/a/teuthology-2013-06-23_01:00:46-fs-master-testing-basic/43375/teuthology.log</p>
<p>(I think I didn't have the root error last time, as I see similar output farther down in this log — although no reference to the CommandFailedError.)</p>
<p>There aren't any core dumps, although there are MDS logs.</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=238982013-06-24T10:08:01ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Josh, I went back and looked at the first instance (/a/teuthology-2013-06-18_01\:00\:37-fs-next-testing-basic/38877/) and I do see an MDS core dump there. That was caused by the assert in standby_trim_segments that we just fixed over, so that may be a clue but the NoneType issue is recurring without that problem.<br />At a quick guess it's being thrown because the thrasher task has an empty list that it is assuming has contents, probably from the mds map dump or something?</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=239022013-06-24T10:44:24ZGreg Farnumgfarnum@redhat.com
<ul></ul><p><a class="issue tracker-1 status-3 priority-5 priority-high3 closed" title="Bug: mds: segfault in MDLog::standby_trim_segments (Resolved)" href="https://tracker.ceph.com/issues/5333">#5333</a> is what I was referring to. There's a whole string of failures which are hitting both that and this.</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=240812013-06-28T15:24:07ZJosh Durgin
<ul></ul><p>I think this is just a symtom of the mds_thrasher crashing, but not logging the exception since this join happens before the mds_thrasher thread is run again, triggering this bug.</p>
<p>If you add a bunch of logging to the mds_thrasher you might be able to find the root cause.</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=240822013-06-28T15:28:44ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Yeah, I am/somebody will need to spend some time digging into this when we have some time free. There's another issue with the thrasher not turning off that I'm seeing too.<br />What led you to think the thrasher thread might be crashing?</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=240832013-06-28T17:32:38ZJosh Durgin
<ul></ul><p>IME that's what this kind of error from gevent/eventlet etc. means - once the thread exits in a certain abnormal way, it's no longer in the internal list of linked threads, so joining it fails.</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=287892013-10-21T22:33:03ZGreg Farnumgfarnum@redhat.com
<ul></ul><p>Still seeing this sometimes, for the record: /a/teuthology-2013-10-20_19:01:21-fs-dumpling-testing-basic-plana/61470/</p> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=316332014-02-03T06:51:52ZSage Weilsage@newdream.net
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Resolved</i></li></ul> CephFS - Bug #5411: teuthology: bad object dereferencehttps://tracker.ceph.com/issues/5411?journal_id=745942016-07-13T05:51:51ZGreg Farnumgfarnum@redhat.com
<ul><li><strong>Component(FS)</strong> <i>MDS</i> added</li></ul>