Project

General

Profile

Actions

Bug #1744

closed

teuthology: race with daemon shutdown?

Added by Sage Weil over 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
teuthology
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


INFO:teuthology.task.cfuse:Unmounting ceph-fuse clients...
INFO:teuthology.task.cfuse.cfuse.0.err:ceph-fuse[24778]: fuse finished with error 0
INFO:teuthology.task.ceph:Shutting down mds daemons...
INFO:teuthology.task.ceph.mds.0:mds.0: Stopped
INFO:teuthology.task.ceph:Shutting down osd daemons...
INFO:teuthology.task.ceph:Shutting down mon daemons...
INFO:teuthology.task.ceph.mon.1:mon.1: Stopped
INFO:teuthology.task.ceph.mon.0:mon.0: Stopped
INFO:teuthology.task.ceph.mon.2:mon.2: Stopped
INFO:teuthology.task.ceph:Checking cluster ceph.log for badness...
INFO:teuthology.task.ceph:Cleaning ceph cluster...
INFO:teuthology.orchestra.run.err:rm: cannot remove `/tmp/cephtest/data/osd.0.data/current/meta': Directory not empty
INFO:teuthology.task.ceph:Removing ceph binaries...
INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump...
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2fa96d0>
Traceback (most recent call last):
  File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/task/ceph.py", line 1066, in task
    yield
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/contextutil.py", line 35, in nested
    if exit(*exc):
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/qa/teuthology/teuthology/task/ceph.py", line 718, in cluster
    wait=False,
  File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait
    proc.exitstatus.get()
  File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 223, in get
    raise self._exception
CommandFailedError: Command failed with status 1: 'rm -rf -- /tmp/cephtest/ceph.conf /tmp/cephtest/ceph.keyring /tmp/cephtest/data /tmp/cephtest/monmap /tmp/cephtest/asok.*'
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
WARNING:teuthology.task.internal:Found coredumps on ubuntu@sepia69.ceph.dreamhost.com, flagging run as failed
INFO:teuthology.task.internal:Removing archive directory...
INFO:teuthology.task.internal:Tidying up after the test...
INFO:teuthology.orchestra.run.err:rmdir: failed to remove `/tmp/cephtest': Directory not empty
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2f08bd0>
Traceback (most recent call last):
  File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/task/internal.py", line 43, in base
    wait=False,
  File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait
    proc.exitstatus.get()
  File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 207, in get
    raise self._exception
CommandFailedError: Command failed with status 1: 'rmdir -- /tmp/cephtest'

Note that osds should be stopped, but the rm -r failed due to a racing process creating files while it was running. /tmp/cephtest is left with

ubuntu@sepia68:/tmp/cephtest$ find . -ls
4194311    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 .
4194522    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data
4194530    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data
4194542    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data/current
4194548    8 drwxr-xr-x   2 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data/current/meta
4194528    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.2__0_DAE19372
4194527    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.6__0_DAE19632
4194532    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.0__0_DAE19212
4194526    8 -rw-r--r--   1 ubuntu   ubuntu      61668 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.4__0_103B1A2E
4194524    4 -rw-r--r--   1 ubuntu   ubuntu      60696 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.1__0_103B043E
4194531    4 -rw-r--r--   1 ubuntu   ubuntu      68904 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.0__0_103B076E
4194523    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.1__0_DAE193A2
4194536    8 -rw-r--r--   1 ubuntu   ubuntu      74844 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.7__0_103B185E
4194533    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.5__0_DAE19162
4194534    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.3__0_DAE19002
4194537    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.7__0_DAE197C2
4194535    8 -rw-r--r--   1 ubuntu   ubuntu      74628 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.2__0_103B05CE
4194549    4 -rw-r--r--   1 ubuntu   ubuntu      59184 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.6__0_103B188E
4194541    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.4__0_DAE191D2
4194551    4 -rw-r--r--   1 ubuntu   ubuntu      61020 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.3__0_103B1A9E
4194553    4 -rw-r--r--   1 ubuntu   ubuntu      65016 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.5__0_103B1BFE

Is this because 'killall ceph-osd' was run but the process didn't actually stop immediately? Or some other reason? (Why don't we see a

INFO:teuthology.task.ceph.osd.0:osd.0: Stopped

line?)


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #1997: teuthology: wait for clean osd shutdown before umountResolvedJosh Durgin01/27/2012

Actions
Actions #1

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 11
Actions #2

Updated by Anonymous over 12 years ago

Was this using any one of the following?

teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.py

Can I see the full config and log?

Actions #3

Updated by Sage Weil over 12 years ago

Tommi Virtanen wrote:

Was this using any one of the following?

teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.py

Can I see the full config and log?

sadly no, it was in my terminal. 90% sure it wasn't any of the above, though! yaml below:

interactive-on-error: true
roles:
- - mon.0
  - mds.0
  - osd.0
- - mon.1
  - osd.1
- - mon.2
  - client.0
tasks:
- ceph:
    conf:
      client.0:
        debug ms: 1
        debug client: 20
- cfuse: null
- workunit:
    all:
    - suites/dbench.sh

targets:
  ubuntu@sepia68.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4tEUrS50JgpGleCROSvvJmvFwbQScDxZwo5qqY9G/8+Gw92v7PMeBjWXA7Q+LTIq+pnzNKQxUeYTSpxYHDMAr1ZqaC9rfTvDMS6DQgMn+O0O23+mXDok8a3NW/S4id2bZQ3lORol4xRzkLqF7HF/YYzNlRv/RxwZifQE9iiIEt0vio17Y5Nw6c1wJBWbTjhI+dM4tHfIG9dDdOL/HmTveF1TqIRIiIJM+rtYeqj1P7eeyEU/0kwhwFgUU1JUCnv2Szf8ZtF4MWFewVl5G1sbowDMXk0NJfuOf4saFJ9gF2SdvUxXbuMUgQ1Edjgtek2HCOnBWTTw3t5+zn51S08wt
  ubuntu@sepia69.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9bW2x5nGRrB0hNpG2Cd5/94b3uheKlfjax0aF3uHFm93wLmpB6U2EcrAEATMAjV0lauR6CPppitoxgnUZX+8KyJ8yDu7TOYtUZvAyxDO09245FuxFRytGI9tyFOgfvYN7fSg2URpkLT1wWMVQ6NgzAV+oS/jsrKgAW66wrmLv61MMghKN6u7hLQnTF6iSod7RMMyvurCdxKu1gU37GEdp8wJH6y8f173MkfDOEVoojafUPycRB2X3dFn63raGYkhkXofwZB1CSEJEVU3Rnz/b+Hkc374Ij+vPd97Q7fIRaI4YE0ovEDDcG0WU++9VN7vI/wW0pc7DAiPun9SVpqM1
  ubuntu@sepia80.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCts9+jqfNcV/Be7XXh87aCtjWMf2tzSv4FFH1QuQ9QJnpwTQ9Nwk5dBnZKYGOPO0WYlY//3Be/IoCnCubH7+rCjsfluZ0o+wJVdKLazJG3KeiLP1LgJPB8J+emv4cCshTJbuZoFgeNzADdwF8C239ImnPzJqQdh2Av7aomnv8TcZFYwxjyMSBI/F4VQX4Ml8cWq2U1jnk6ft7cs1lZSe+74/glVjWpxBDcgMDFNkGUlLyPdWXxYUMdWHkayTvu8j4/AV8HQaFkbmLks/FTCjpYawCuETflkGl/WbTl00hARGHrPPJpDHmHIf6rEuwMMZ2Eh2Ggui3s9jhmRqlAyGVH

Actions #4

Updated by Anonymous over 12 years ago

Josh saw similar, it seems the ctx.daemons data structure loses entries / they never get added / something. So far, reading the source & logs has just led to paradoxes.

Actions #5

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40
Actions #6

Updated by Sage Weil over 12 years ago

any additional teuthology logging we can add to sort out what is happening?

Actions #7

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position deleted (70)
  • Translation missing: en.field_position set to 98
Actions #8

Updated by Josh Durgin about 12 years ago

  • Status changed from New to Resolved
  • Assignee set to Josh Durgin

This turned out to be uncaught exceptions that weren't logged until later when daemons crashed. Fixed by 1364b8826f3f654fbbdddc8ba53dcec8ee038cf8.

Actions

Also available in: Atom PDF