Project

General

Profile

Actions

Bug #5411

closed

teuthology: bad object dereference

Added by Greg Farnum almost 11 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-06-18T04:21:39.336 INFO:teuthology.task.ceph:Checking for errors in any valgrind logs...
2013-06-18T04:21:39.337 DEBUG:teuthology.orchestra.run:Running [10.214.133.30]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 
2013-06-18T04:21:39.340 DEBUG:teuthology.orchestra.run:Running [10.214.133.24]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 
2013-06-18T04:21:39.376 DEBUG:teuthology.orchestra.run:Running [10.214.133.27]: "sudo grep -r '<kind>' /var/log/ceph/valgrind | sort | uniq" 
2013-06-18T04:21:39.379 INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump chdir-coredump valgrind.supp kcon_most...
2013-06-18T04:21:39.379 DEBUG:teuthology.orchestra.run:Running [10.214.133.30]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes
t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most'
2013-06-18T04:21:39.389 DEBUG:teuthology.orchestra.run:Running [10.214.133.27]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes
t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most'
2013-06-18T04:21:39.450 DEBUG:teuthology.orchestra.run:Running [10.214.133.24]: 'rm -rf -- /home/ubuntu/cephtest/38877/daemon-helper /home/ubuntu/cephtest/38877/enable-coredump /home/ubuntu/cephtes
t/38877/chdir-coredump /home/ubuntu/cephtest/38877/valgrind.supp /home/ubuntu/cephtest/38877/kcon_most'
2013-06-18T04:21:39.456 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x1dd1810>
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-next/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 1100, in task
    yield
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/teuthology-next/teuthology/contextutil.py", line 35, in nested
    if exit(*exc):
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 908, in run_daemon
    teuthology.stop_daemons_of_type(ctx, type_)
  File "/home/teuthworker/teuthology-next/teuthology/misc.py", line 864, in stop_daemons_of_type
    daemon.stop()
  File "/home/teuthworker/teuthology-next/teuthology/task/ceph.py", line 35, in stop
    run.wait([self.proc])
  File "/home/teuthworker/teuthology-next/teuthology/orchestra/run.py", line 281, in wait
    proc.exitstatus.get()
  File "/home/teuthworker/teuthology-next/virtualenv/local/lib/python2.7/site-packages/gevent/event.py", line 207, in get
    raise self._exception
CommandFailedError: Command failed on 10.214.133.24 with status 1: '/home/ubuntu/cephtest/38877/enable-coredump ceph-coverage /home/ubuntu/cephtest/38877/archive/coverage sudo /home/ubuntu/cephtest
/38877/daemon-helper kill ceph-mds -f -i b-s-a'
2013-06-18T04:21:39.456 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x1dd1450>
2013-06-18T04:21:39.456 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-next/teuthology/contextutil.py", line 27, in nested
    yield vars
  File "/home/teuthworker/teuthology-next/teuthology/task/install.py", line 735, in task
    yield
  File "/home/teuthworker/teuthology-next/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-next/teuthology/task/mds_thrash.py", line 317, in task
    thrashers[t].do_join()
  File "/home/teuthworker/teuthology-next/teuthology/task/mds_thrash.py", line 107, in do_join
    self.thread.get()
  File "/home/teuthworker/teuthology-next/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
TypeError: 'NoneType' object has no attribute '__getitem__'

This is a new issue; I'm not sure if it's because of some valgrind check issue or what, but I have the sad suspicion that maybe we're getting a list back empty when it shouldn't be.

Actions #1

Updated by Ian Colle almost 11 years ago

  • Priority changed from Normal to High
Actions #2

Updated by Josh Durgin almost 11 years ago

  • Category set to 47

If you look at the message from the first exception, it says the mds failed:

CommandFailedError: Command failed on 10.214.133.24 with status 1: '/home/ubuntu/cephtest/38877/enable-coredump ceph-coverage /home/ubuntu/cephtest/38877/archive/coverage sudo /home/ubuntu/cephtest
/38877/daemon-helper kill ceph-mds -f -i b-s-a'

The bad object dereference might be a bug in the mds_thrash task, but the root cause here is an MDS crash.

Actions #3

Updated by Greg Farnum almost 11 years ago

Happened again

2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:joining mds_thrashers
2013-06-23T04:10:12.179 INFO:teuthology.task.mds_thrash:join thrasher for failure group [a, b-s-a]
2013-06-23T04:10:12.179 ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x235f650>
Traceback (most recent call last):
  File "/home/teuthworker/teuthology-master/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 317, in task
    thrashers[t].do_join()
  File "/home/teuthworker/teuthology-master/teuthology/task/mds_thrash.py", line 107, in do_join
    self.thread.get()
  File "/home/teuthworker/teuthology-master/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
TypeError: 'NoneType' object has no attribute '__getitem__'

/a/teuthology-2013-06-23_01:00:46-fs-master-testing-basic/43375/teuthology.log

(I think I didn't have the root error last time, as I see similar output farther down in this log — although no reference to the CommandFailedError.)

There aren't any core dumps, although there are MDS logs.

Actions #4

Updated by Greg Farnum almost 11 years ago

Josh, I went back and looked at the first instance (/a/teuthology-2013-06-18_01\:00\:37-fs-next-testing-basic/38877/) and I do see an MDS core dump there. That was caused by the assert in standby_trim_segments that we just fixed over, so that may be a clue but the NoneType issue is recurring without that problem.
At a quick guess it's being thrown because the thrasher task has an empty list that it is assuming has contents, probably from the mds map dump or something?

Actions #5

Updated by Greg Farnum almost 11 years ago

#5333 is what I was referring to. There's a whole string of failures which are hitting both that and this.

Actions #6

Updated by Josh Durgin almost 11 years ago

I think this is just a symtom of the mds_thrasher crashing, but not logging the exception since this join happens before the mds_thrasher thread is run again, triggering this bug.

If you add a bunch of logging to the mds_thrasher you might be able to find the root cause.

Actions #7

Updated by Greg Farnum almost 11 years ago

Yeah, I am/somebody will need to spend some time digging into this when we have some time free. There's another issue with the thrasher not turning off that I'm seeing too.
What led you to think the thrasher thread might be crashing?

Actions #8

Updated by Josh Durgin almost 11 years ago

IME that's what this kind of error from gevent/eventlet etc. means - once the thread exits in a certain abnormal way, it's no longer in the internal list of linked threads, so joining it fails.

Actions #9

Updated by Greg Farnum over 10 years ago

Still seeing this sometimes, for the record: /a/teuthology-2013-10-20_19:01:21-fs-dumpling-testing-basic-plana/61470/

Actions #10

Updated by Sage Weil about 10 years ago

  • Status changed from New to Resolved
Actions #11

Updated by Greg Farnum almost 8 years ago

  • Component(FS) MDS added
Actions

Also available in: Atom PDF