Bug #1744
closedteuthology: race with daemon shutdown?
0%
Description
INFO:teuthology.task.cfuse:Unmounting ceph-fuse clients... INFO:teuthology.task.cfuse.cfuse.0.err:ceph-fuse[24778]: fuse finished with error 0 INFO:teuthology.task.ceph:Shutting down mds daemons... INFO:teuthology.task.ceph.mds.0:mds.0: Stopped INFO:teuthology.task.ceph:Shutting down osd daemons... INFO:teuthology.task.ceph:Shutting down mon daemons... INFO:teuthology.task.ceph.mon.1:mon.1: Stopped INFO:teuthology.task.ceph.mon.0:mon.0: Stopped INFO:teuthology.task.ceph.mon.2:mon.2: Stopped INFO:teuthology.task.ceph:Checking cluster ceph.log for badness... INFO:teuthology.task.ceph:Cleaning ceph cluster... INFO:teuthology.orchestra.run.err:rm: cannot remove `/tmp/cephtest/data/osd.0.data/current/meta': Directory not empty INFO:teuthology.task.ceph:Removing ceph binaries... INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump... ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2fa96d0> Traceback (most recent call last): File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/home/qa/teuthology/teuthology/task/ceph.py", line 1066, in task yield File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/home/qa/teuthology/teuthology/contextutil.py", line 35, in nested if exit(*exc): File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__ self.gen.throw(type, value, traceback) File "/home/qa/teuthology/teuthology/task/ceph.py", line 718, in cluster wait=False, File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait proc.exitstatus.get() File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 223, in get raise self._exception CommandFailedError: Command failed with status 1: 'rm -rf -- /tmp/cephtest/ceph.conf /tmp/cephtest/ceph.keyring /tmp/cephtest/data /tmp/cephtest/monmap /tmp/cephtest/asok.*' INFO:teuthology.orchestra.run.out:kernel.core_pattern = core INFO:teuthology.orchestra.run.out:kernel.core_pattern = core INFO:teuthology.orchestra.run.out:kernel.core_pattern = core WARNING:teuthology.task.internal:Found coredumps on ubuntu@sepia69.ceph.dreamhost.com, flagging run as failed INFO:teuthology.task.internal:Removing archive directory... INFO:teuthology.task.internal:Tidying up after the test... INFO:teuthology.orchestra.run.err:rmdir: failed to remove `/tmp/cephtest': Directory not empty ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2f08bd0> Traceback (most recent call last): File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__ self.gen.next() File "/home/qa/teuthology/teuthology/task/internal.py", line 43, in base wait=False, File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait proc.exitstatus.get() File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 207, in get raise self._exception CommandFailedError: Command failed with status 1: 'rmdir -- /tmp/cephtest'
Note that osds should be stopped, but the rm -r failed due to a racing process creating files while it was running. /tmp/cephtest is left with
ubuntu@sepia68:/tmp/cephtest$ find . -ls 4194311 4 drwxr-xr-x 3 ubuntu ubuntu 4096 Nov 20 19:13 . 4194522 4 drwxr-xr-x 3 ubuntu ubuntu 4096 Nov 20 19:13 ./data 4194530 4 drwxr-xr-x 3 ubuntu ubuntu 4096 Nov 20 19:13 ./data/osd.0.data 4194542 4 drwxr-xr-x 3 ubuntu ubuntu 4096 Nov 20 19:13 ./data/osd.0.data/current 4194548 8 drwxr-xr-x 2 ubuntu ubuntu 4096 Nov 20 19:13 ./data/osd.0.data/current/meta 4194528 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.2__0_DAE19372 4194527 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.6__0_DAE19632 4194532 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.0__0_DAE19212 4194526 8 -rw-r--r-- 1 ubuntu ubuntu 61668 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.4__0_103B1A2E 4194524 4 -rw-r--r-- 1 ubuntu ubuntu 60696 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.1__0_103B043E 4194531 4 -rw-r--r-- 1 ubuntu ubuntu 68904 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.0__0_103B076E 4194523 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.1__0_DAE193A2 4194536 8 -rw-r--r-- 1 ubuntu ubuntu 74844 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.7__0_103B185E 4194533 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.5__0_DAE19162 4194534 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.3__0_DAE19002 4194537 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.7__0_DAE197C2 4194535 8 -rw-r--r-- 1 ubuntu ubuntu 74628 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.2__0_103B05CE 4194549 4 -rw-r--r-- 1 ubuntu ubuntu 59184 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.6__0_103B188E 4194541 4 -rw-r--r-- 1 ubuntu ubuntu 8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.4__0_DAE191D2 4194551 4 -rw-r--r-- 1 ubuntu ubuntu 61020 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.3__0_103B1A9E 4194553 4 -rw-r--r-- 1 ubuntu ubuntu 65016 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.5__0_103B1BFE
Is this because 'killall ceph-osd' was run but the process didn't actually stop immediately? Or some other reason? (Why don't we see a
INFO:teuthology.task.ceph.osd.0:osd.0: Stopped
line?)
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position set to 11
Updated by Anonymous over 12 years ago
Was this using any one of the following?
teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.py
Can I see the full config and log?
Updated by Sage Weil over 12 years ago
Tommi Virtanen wrote:
Was this using any one of the following?
teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.pyCan I see the full config and log?
sadly no, it was in my terminal. 90% sure it wasn't any of the above, though! yaml below:
interactive-on-error: true roles: - - mon.0 - mds.0 - osd.0 - - mon.1 - osd.1 - - mon.2 - client.0 tasks: - ceph: conf: client.0: debug ms: 1 debug client: 20 - cfuse: null - workunit: all: - suites/dbench.sh targets: ubuntu@sepia68.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4tEUrS50JgpGleCROSvvJmvFwbQScDxZwo5qqY9G/8+Gw92v7PMeBjWXA7Q+LTIq+pnzNKQxUeYTSpxYHDMAr1ZqaC9rfTvDMS6DQgMn+O0O23+mXDok8a3NW/S4id2bZQ3lORol4xRzkLqF7HF/YYzNlRv/RxwZifQE9iiIEt0vio17Y5Nw6c1wJBWbTjhI+dM4tHfIG9dDdOL/HmTveF1TqIRIiIJM+rtYeqj1P7eeyEU/0kwhwFgUU1JUCnv2Szf8ZtF4MWFewVl5G1sbowDMXk0NJfuOf4saFJ9gF2SdvUxXbuMUgQ1Edjgtek2HCOnBWTTw3t5+zn51S08wt ubuntu@sepia69.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9bW2x5nGRrB0hNpG2Cd5/94b3uheKlfjax0aF3uHFm93wLmpB6U2EcrAEATMAjV0lauR6CPppitoxgnUZX+8KyJ8yDu7TOYtUZvAyxDO09245FuxFRytGI9tyFOgfvYN7fSg2URpkLT1wWMVQ6NgzAV+oS/jsrKgAW66wrmLv61MMghKN6u7hLQnTF6iSod7RMMyvurCdxKu1gU37GEdp8wJH6y8f173MkfDOEVoojafUPycRB2X3dFn63raGYkhkXofwZB1CSEJEVU3Rnz/b+Hkc374Ij+vPd97Q7fIRaI4YE0ovEDDcG0WU++9VN7vI/wW0pc7DAiPun9SVpqM1 ubuntu@sepia80.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCts9+jqfNcV/Be7XXh87aCtjWMf2tzSv4FFH1QuQ9QJnpwTQ9Nwk5dBnZKYGOPO0WYlY//3Be/IoCnCubH7+rCjsfluZ0o+wJVdKLazJG3KeiLP1LgJPB8J+emv4cCshTJbuZoFgeNzADdwF8C239ImnPzJqQdh2Av7aomnv8TcZFYwxjyMSBI/F4VQX4Ml8cWq2U1jnk6ft7cs1lZSe+74/glVjWpxBDcgMDFNkGUlLyPdWXxYUMdWHkayTvu8j4/AV8HQaFkbmLks/FTCjpYawCuETflkGl/WbTl00hARGHrPPJpDHmHIf6rEuwMMZ2Eh2Ggui3s9jhmRqlAyGVH
Updated by Anonymous over 12 years ago
Josh saw similar, it seems the ctx.daemons data structure loses entries / they never get added / something. So far, reading the source & logs has just led to paradoxes.
Updated by Sage Weil over 12 years ago
- Target version changed from v0.39 to v0.40
Updated by Sage Weil over 12 years ago
any additional teuthology logging we can add to sort out what is happening?
Updated by Sage Weil over 12 years ago
- Target version deleted (
v0.40) - Translation missing: en.field_position deleted (
70) - Translation missing: en.field_position set to 98
Updated by Josh Durgin over 12 years ago
- Status changed from New to Resolved
- Assignee set to Josh Durgin
This turned out to be uncaught exceptions that weren't logged until later when daemons crashed. Fixed by 1364b8826f3f654fbbdddc8ba53dcec8ee038cf8.