Project

General

Profile

Analysis

When a teuthology job has failed, it needs to be analyzed and linked to an issue (in http://tracker.ceph.com/). A number of teuthology jobs inject random failures in the cluster to observe how it behaves. It it therefore not uncommon to see jobs that fail or succeed, depending. A failed job is always a cause for concern and if it only fails rarely, it can be difficult to diagnose properly. Such jobs are sometime scheduled a number of times to increase the chance for the problem to show and help with diagnostic (that's one use of the --num option of teuthology-suite).

Simple error matching

Deeper analysis

The error message displayed by the teuthology job as the source of the problem is often non informative and deeper analysis is necessary.

  • Click All details... in the pulpito page to show the YAML file from which the job was created and Control-f description to see the job description which is the list of YAML files that were used to create the job. They can be found at https://github.com/ceph/ceph-qa-suite/blob/firefly/suites (where firefly can be replaced by the stable release name).
  • Download the teuthology logs from the link provided by the pulpito page (for instance http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/teuthology.log)
  • Explore the logs and core dumps collected by teuthology. If the log is at http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/teuthology.log the rest can be found by removing the teuthology.log part of the path, i.e. http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/
  • In the teuthology log, look for the first Traceback and look around it: this is when something went wrong first.
    2015-05-15T03:56:10.905 ERROR:teuthology.contextutil:Saw exception from nested tasks
    Traceback (most recent call last):
      File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested
        yield vars
      File "/home/teuthworker/src/teuthology_master/teuthology/task/install.py", line 1298, in task
        yield
      File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 125, in run_tasks
        suppress = manager.__exit__(*exc_info)
      File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
        self.gen.next()
      File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/thrashosds.py", line 183, in task
        thrash_proc.do_join()
      File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph_manager.py", line 356, in do_join
        self.thread.get()
      File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
        raise self._exception
    Exception: ceph-objectstore-tool: import failure with status 139
    
  • Examine the relevant OSD, MDS or MON logs. The logs are used on a daily basis by developers to figure out problems. They are not an easy read but they can be relied on to display the necessary information to figure out the sequence of operations that lead to a given problem.
  • Obtain a backtrace from the coredumps (see http://dachary.org/?p=3568 for a way to do that). This is usually not necessary because the backtrace can be found in the the OSD, MDS or MON logs.

Tools