Bug #5411
teuthology: bad object dereference
0%
Description
2013-06-18T04:21:39.336 INFO:teuthology.task.ceph:Checking for errors in any valgrind logs... 2013-06-18T04:21:39.337 DEBUG:teuthology.orchestra.run:Running [10.214.133.30]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 2013-06-18T04:21:39.340 DEBUG:teuthology.orchestra.run:Running [10.214.133.24]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 2013-06-18T04:21:39.376 DEBUG:teuthology.orchestra.run:Running [10.214.133.27]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 2013-06-18T04:21:39.379 INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump chdir-coredump valgrind.supp kcon_most... 2013-06-18T04:21:39.379 DEBUG:teuthology.orchestra.run:Running [10.214.133.30]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most' 2013-06-18T04:21:39.389 DEBUG:teuthology.orchestra.run:Running [10.214.133.27]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most' 2013-06-18T04:21:39.450 DEBUG:teuthology.orchestra.run:Running [10.214.133.24]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most' 2013-06-18T04:21:39.456 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1dd1810> Traceback (most recent call last): File "/home/teuthworker/teuthology-next/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__ self.gen.throw(type, value, traceback) File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 1100, in task yield File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__ self.gen.throw(type, value, traceback) File "/home/teuthworker/teuthology-next/teuthology/contextutil.py", line 35, in nested if exit(*exc): File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__ self.gen.throw(type, value, traceback) File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 908, in run_daemon teuthology.stop_daemons_of_type(ctx, type_) File "/home/teuthworker/teuthology-next/teuthology/misc.py", line 864, in stop_daemons_of_type daemon.stop() File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 35, in stop run.wait([self.proc]) File "/home/teuthworker/teuthology-next/teuthology/orchestra/run.py", line 281, in wait proc.exitstatus.get() File "/home/teuthworker/teuthology-next/virtualenv/local/lib/python2.7/site-packages/gevent/event.py", line 207, in get raise self._exception CommandFailedError: Command failed on 10.214.133.24 with status 1: '/home/ubuntu/cephtest/38877/enable-coredump ceph-coverage /home/ubuntu/cephtest/38877/archive/coverage sudo /home/ubuntu/cephtest /38877/daemon-helper kill ceph-mds -f -i b-s-a' 2013-06-18T04:21:39.456 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1dd1450> 2013-06-18T04:21:39.456 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/home/teuthworker/teuthology-next/teuthology/contextutil.py", line 27, in nested yield vars File "/home/teuthworker/teuthology-next/teuthology/task/install.py", line 735, in task yield File "/home/teuthworker/teuthology-next/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/teuthology-next/teuthology/task/mds_thrash.py", line 317, in task thrashers[t].do_join() File "/home/teuthworker/teuthology-next/teuthology/task/mds_thrash.py", line 107, in do_join self.thread.get() File "/home/teuthworker/teuthology-next/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get raise self._exception TypeError: 'NoneType' object has no attribute '__getitem__'
This is a new issue; I'm not sure if it's because of some valgrind check issue or what, but I have the sad suspicion that maybe we're getting a list back empty when it shouldn't be.
History
#1 Updated by Ian Colle over 10 years ago
- Priority changed from Normal to High
#2 Updated by Josh Durgin over 10 years ago
- Category set to 47
If you look at the message from the first exception, it says the mds failed:
CommandFailedError: Command failed on 10.214.133.24 with status 1: '/home/ubuntu/cephtest/38877/enable-coredump ceph-coverage /home/ubuntu/cephtest/38877/archive/coverage sudo /home/ubuntu/cephtest /38877/daemon-helper kill ceph-mds -f -i b-s-a'
The bad object dereference might be a bug in the mds_thrash task, but the root cause here is an MDS crash.
#3 Updated by Greg Farnum over 10 years ago
Happened again
2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:joining mds_thrashers 2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:join thrasher for failure group [a, b-s-a] 2013-06-23T04:10:12.179 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x235f650> Traceback (most recent call last): File "/home/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 317, in task thrashers[t].do_join() File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 107, in do_join self.thread.get() File "/home/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get raise self._exception TypeError: 'NoneType' object has no attribute '__getitem__'
/a/teuthology-2013-06-23_01:00:46-fs-master-testing-basic/43375/teuthology.log
(I think I didn't have the root error last time, as I see similar output farther down in this log — although no reference to the CommandFailedError.)
There aren't any core dumps, although there are MDS logs.
#4 Updated by Greg Farnum over 10 years ago
Josh, I went back and looked at the first instance (/a/teuthology-2013-06-18_01\:00\:37-fs-next-testing-basic/38877/) and I do see an MDS core dump there. That was caused by the assert in standby_trim_segments that we just fixed over, so that may be a clue but the NoneType issue is recurring without that problem.
At a quick guess it's being thrown because the thrasher task has an empty list that it is assuming has contents, probably from the mds map dump or something?
#5 Updated by Greg Farnum over 10 years ago
#5333 is what I was referring to. There's a whole string of failures which are hitting both that and this.
#6 Updated by Josh Durgin over 10 years ago
I think this is just a symtom of the mds_thrasher crashing, but not logging the exception since this join happens before the mds_thrasher thread is run again, triggering this bug.
If you add a bunch of logging to the mds_thrasher you might be able to find the root cause.
#7 Updated by Greg Farnum over 10 years ago
Yeah, I am/somebody will need to spend some time digging into this when we have some time free. There's another issue with the thrasher not turning off that I'm seeing too.
What led you to think the thrasher thread might be crashing?
#8 Updated by Josh Durgin over 10 years ago
IME that's what this kind of error from gevent/eventlet etc. means - once the thread exits in a certain abnormal way, it's no longer in the internal list of linked threads, so joining it fails.
#9 Updated by Greg Farnum almost 10 years ago
Still seeing this sometimes, for the record: /a/teuthology-2013-10-20_19:01:21-fs-dumpling-testing-basic-plana/61470/
#10 Updated by Sage Weil over 9 years ago
- Status changed from New to Resolved
#11 Updated by Greg Farnum about 7 years ago
- Component(FS) MDS added