Actions
Bug #18129
closedcan't nuke mira097
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
2016-12-02 17:23:03,301.301 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy 2016-12-02 17:23:03,328.328 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy 2016-12-02 17:23:03,329.329 ERROR:teuthology.parallel:Exception in parallel execution Traceback (most recent call last): File "/home/yuriw/teuthology/teuthology/parallel.py", line 83, in __exit__ for result in self: File "/home/yuriw/teuthology/teuthology/parallel.py", line 101, in next resurrect_traceback(result) File "/home/yuriw/teuthology/teuthology/parallel.py", line 19, in capture_traceback return func(*args, **kwargs) File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 93, in _purge_data 'rm', '-rf', '--one-file-system', '--', '/var/lib/ceph', File "/home/yuriw/teuthology/teuthology/orchestra/remote.py", line 192, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 403, in run r.wait() File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 166, in wait label=self.label) CommandFailedError: Command failed on mira097 with status 1: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 2016-12-02 17:23:03,330.330 ERROR:teuthology.nuke:Could not nuke {u'mira097.front.sepia.ceph.com': u'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDG2dOJzrUSyviG/hX39DV044Im1qvLw+u9bdgGZCaTpJrl2up9lPP7tVKhRFQ2I9i0CRZG+7nZiWH4WicZt6+5tFaPyJpZZJQgyaS9DVd9mOdSVtFzTcMCStztr8ULWjf7MA+c385u5UbW6gxYfpAZIVTpGhIiPAUDOvFP8CNtDusGoB685pKs4vOtppLBEoiRjC78kJ3mAV1omOkTr52+38reZ22FchlaQpkmZXNJ0YM4KVNnMwXuYl5KJpnj3Z4dYcyJqDTSdVw373uoklZwQ11ogRMmQISQzaBjfIbOx8bqSrlm5iMSwvYtDC3V3VjTG0jE5japdK1HNl7g2WkH'} Traceback (most recent call last): File "/home/yuriw/teuthology/teuthology/nuke/__init__.py", line 281, in nuke_one nuke_helper(ctx, should_unlock) File "/home/yuriw/teuthology/teuthology/nuke/__init__.py", line 339, in nuke_helper remove_ceph_data(ctx) File "/home/yuriw/teuthology/teuthology/nuke/actions.py", line 332, in remove_ceph_data install_task.purge_data(ctx) File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 67, in purge_data p.spawn(_purge_data, remote) File "/home/yuriw/teuthology/teuthology/parallel.py", line 83, in __exit__ for result in self: File "/home/yuriw/teuthology/teuthology/parallel.py", line 101, in next resurrect_traceback(result) File "/home/yuriw/teuthology/teuthology/parallel.py", line 19, in capture_traceback return func(*args, **kwargs) File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 93, in _purge_data 'rm', '-rf', '--one-file-system', '--', '/var/lib/ceph', File "/home/yuriw/teuthology/teuthology/orchestra/remote.py", line 192, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 403, in run r.wait() File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 166, in wait label=self.label) CommandFailedError: Command failed on mira097 with status 1: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 2016-12-02 17:23:03,331.331 ERROR:teuthology.nuke:Could not nuke the following targets: targets: mira097.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDG2dOJzrUSyviG/hX39DV044Im1qvLw+u9bdgGZCaTpJrl2up9lPP7tVKhRFQ2I9i0CRZG+7nZiWH4WicZt6+5tFaPyJpZZJQgyaS9DVd9mOdSVtFzTcMCStztr8ULWjf7MA+c385u5UbW6gxYfpAZIVTpGhIiPAUDOvFP8CNtDusGoB685pKs4vOtppLBEoiRjC78kJ3mAV1omOkTr52+38reZ22FchlaQpkmZXNJ0YM4KVNnMwXuYl5KJpnj3Z4dYcyJqDTSdVw373uoklZwQ11ogRMmQISQzaBjfIbOx8bqSrlm5iMSwvYtDC3V3VjTG0jE5japdK1HNl7g2WkH
Updated by Yuri Weinstein over 7 years ago
Updated by Zack Cerza over 7 years ago
2016-12-02T11:04:12.844 ERROR:teuthology.task.internal:Host mira097 has stale /var/lib/ceph, check lock and nuke/cleanup. [...] 2016-12-02T11:04:14.087 INFO:teuthology.nuke.actions:Unmounting ceph-fuse and killing daemons... [...] 2016-12-02T11:07:26.999 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy
This node shouldn't have been unlocked in the first place. I do see the attempt to kill daemons as well. There's definitely a bug here, preventing the daemons from dying.
Updated by Zack Cerza over 7 years ago
After a manual nuke failing:
[root@mira097 ~]# mount | grep ceph /dev/mapper/all-ceph on /var/lib/ceph type ext4 (rw,relatime,seclabel,data=ordered) [root@mira097 ~]# grep ceph /etc/fstab /dev/all/ceph /var/lib/ceph ext4 defaults 1 1
Updated by Zack Cerza over 7 years ago
- Assignee changed from David Galloway to Loïc Dachary
From the paddles log:
paddles.out.log.4.gz:2016-11-28 07:29:27,685 INFO [paddles.controllers.nodes] Unlocked <Node mira097.front.sepia.ceph.com> for mgolub with description /home/teuthworker/archive/trociny-2016-11-28_07:01:17-rbd-wip-mgolub-testing-jewel---basic-mira/582025 paddles.out.log.4.gz:2016-11-28 09:30:31,828 INFO [paddles.controllers.nodes] Locked <Node mira097.front.sepia.ceph.com> for loic@dachary.org paddles.out.log:2016-12-02 11:02:34,636 DEBUG [paddles.controllers.nodes] Unlocking <Node mira097.front.sepia.ceph.com> for loic@dachary.org paddles.out.log:2016-12-02 11:02:34,636 INFO [paddles.controllers.nodes] Unlocked <Node mira097.front.sepia.ceph.com> for loic@dachary.org paddles.out.log:2016-12-02 11:04:10,848 INFO [paddles.controllers.nodes] Locked <Node mira097.front.sepia.ceph.com> <Node mira100.front.sepia.ceph.com> for scheduled_kchai@teuthology with description /home/teuthworker/archive/kchai-2016-12-02_07:05:23-rados-wip-kefu-testing---basic-mira/594772
So the machine was used for a job, which passed, then Loic locked it, and the first job that picked it up after it was unlocked failed because of a stale /var/lib/ceph
. At this point I think we need to know what Loic did to it :-/
Updated by Loïc Dachary over 7 years ago
I f**ed up the machine big time, in so many ways I can't remember. My intentions were good: figuring out how ceph-disk@.service and ceph-osd@.service could race.
Updated by Zack Cerza over 7 years ago
- Status changed from New to Resolved
- Assignee changed from Loïc Dachary to Zack Cerza
Okay. I'll reimage...
Actions