Project

General

Profile

HOWTO forensic analysis of integration and upgrade tests » History » Version 11

Loïc Dachary, 05/15/2015 12:24 PM

1 7 Loïc Dachary
h3. Analysis
2 1 Loïc Dachary
3 11 Loïc Dachary
When a teuthology job has failed, it needs to be analyzed and linked to an issue (in http://tracker.ceph.com/). A number of teuthology jobs inject random failures in the cluster to observe how it behaves. It it therefore not uncommon to see jobs that fail or succeed, depending. A failed job is always a cause for concern and if it only fails rarely, it can be difficult to diagnose properly. Such jobs are sometime scheduled a number of times to increase the chance for the problem to show and help with diagnostic (that's one use of the **--num** option of **teuthology-suite**).
4 7 Loïc Dachary
5 9 Loïc Dachary
h4. Simple error matching
6 9 Loïc Dachary
7 7 Loïc Dachary
* For a given teuthology job, there is a pulpito page (for instance http://pulpito.ceph.com/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/)
8 5 Loïc Dachary
* Research tracker.ceph.com for the error string to find existing issues. For instance http://pulpito.ceph.com/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/ has *ceph-objectstore-tool: import failure with status 139* which has "a few issues associated with it":http://tracker.ceph.com/projects/ceph/search?utf8=%E2%9C%93&issues=1&q=ceph-objectstore-tool%3A+import+failure+with+status+139
9 1 Loïc Dachary
* If an issue is found and it looks like knowing it happened one more time is useful, add a comment with a link to the failed job and the relevant quote from the logs.
10 9 Loïc Dachary
11 9 Loïc Dachary
h4. Deeper analysis
12 9 Loïc Dachary
13 9 Loïc Dachary
The error message displayed by the teuthology job as the source of the problem is often non informative and deeper analysis is necessary.
14 9 Loïc Dachary
15 9 Loïc Dachary
* Click *All details...* in the pulpito page to show the YAML file from which the job was created and *Control-f description* to see the job description which is the list of YAML files that were used to create the job. They can be found at https://github.com/ceph/ceph-qa-suite/blob/firefly/suites (where *firefly* can be replaced by the stable release name).
16 3 Loïc Dachary
* Download the teuthology logs from the link provided by the pulpito page (for instance http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/teuthology.log)
17 3 Loïc Dachary
* Explore the logs and core dumps collected by teuthology. If the log is at http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/teuthology.log the rest can be found by removing the teuthology.log part of the path, i.e. http://qa-proxy.ceph.com/teuthology/loic-2015-05-13_00:58:29-rados-firefly-backports---basic-multi/888125/
18 3 Loïc Dachary
* In the teuthology log, look for the first *Traceback* and look around it: this is when something went wrong first.
19 3 Loïc Dachary
<pre>
20 3 Loïc Dachary
2015-05-15T03:56:10.905 ERROR:teuthology.contextutil:Saw exception from nested tasks
21 3 Loïc Dachary
Traceback (most recent call last):
22 3 Loïc Dachary
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested
23 3 Loïc Dachary
    yield vars
24 3 Loïc Dachary
  File "/home/teuthworker/src/teuthology_master/teuthology/task/install.py", line 1298, in task
25 3 Loïc Dachary
    yield
26 3 Loïc Dachary
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 125, in run_tasks
27 3 Loïc Dachary
    suppress = manager.__exit__(*exc_info)
28 3 Loïc Dachary
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
29 3 Loïc Dachary
    self.gen.next()
30 3 Loïc Dachary
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/thrashosds.py", line 183, in task
31 3 Loïc Dachary
    thrash_proc.do_join()
32 3 Loïc Dachary
  File "/var/lib/teuthworker/src/ceph-qa-suite_firefly/tasks/ceph_manager.py", line 356, in do_join
33 3 Loïc Dachary
    self.thread.get()
34 3 Loïc Dachary
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
35 3 Loïc Dachary
    raise self._exception
36 3 Loïc Dachary
Exception: ceph-objectstore-tool: import failure with status 139
37 3 Loïc Dachary
</pre>
38 6 Loïc Dachary
* Examine the relevant OSD, MDS or MON logs. The logs are used on a daily basis by developers to figure out problems. They are not an easy read but they can be relied on to display the necessary information to figure out the sequence of operations that lead to a given problem.
39 3 Loïc Dachary
* Obtain a backtrace from the coredumps (see http://dachary.org/?p=3568 for a way to do that), if they are not in the OSD, MDS or MON logs (they usually are)
40 2 Loïc Dachary
41 2 Loïc Dachary
h3. Tools
42 2 Loïc Dachary
43 2 Loïc Dachary
* https://github.com/jcsp/scrape/blob/master/scrape.py
44 2 Loïc Dachary
**  command line example:
45 2 Loïc Dachary
<pre>
46 2 Loïc Dachary
user@machine:~$ python ~/<scrape_dir>/scrape.py /a/<run_name>
47 2 Loïc Dachary
</pre>
48 10 Nathan Cutler
*** this will generally run in all labs (sepia, octo, typica) as */a* exists in all of them