Project

General

Profile

Actions

Bug #18129

closed

can't nuke mira097

Added by Yuri Weinstein over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

2016-12-02 17:23:03,301.301 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy
2016-12-02 17:23:03,328.328 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy
2016-12-02 17:23:03,329.329 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 93, in _purge_data
    'rm', '-rf', '--one-file-system', '--', '/var/lib/ceph',
  File "/home/yuriw/teuthology/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed on mira097 with status 1: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 
2016-12-02 17:23:03,330.330 ERROR:teuthology.nuke:Could not nuke {u'mira097.front.sepia.ceph.com': u'ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDG2dOJzrUSyviG/hX39DV044Im1qvLw+u9bdgGZCaTpJrl2up9lPP7tVKhRFQ2I9i0CRZG+7nZiWH4WicZt6+5tFaPyJpZZJQgyaS9DVd9mOdSVtFzTcMCStztr8ULWjf7MA+c385u5UbW6gxYfpAZIVTpGhIiPAUDOvFP8CNtDusGoB685pKs4vOtppLBEoiRjC78kJ3mAV1omOkTr52+38reZ22FchlaQpkmZXNJ0YM4KVNnMwXuYl5KJpnj3Z4dYcyJqDTSdVw373uoklZwQ11ogRMmQISQzaBjfIbOx8bqSrlm5iMSwvYtDC3V3VjTG0jE5japdK1HNl7g2WkH'}
Traceback (most recent call last):
  File "/home/yuriw/teuthology/teuthology/nuke/__init__.py", line 281, in nuke_one
    nuke_helper(ctx, should_unlock)
  File "/home/yuriw/teuthology/teuthology/nuke/__init__.py", line 339, in nuke_helper
    remove_ceph_data(ctx)
  File "/home/yuriw/teuthology/teuthology/nuke/actions.py", line 332, in remove_ceph_data
    install_task.purge_data(ctx)
  File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 67, in purge_data
    p.spawn(_purge_data, remote)
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 83, in __exit__
    for result in self:
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/yuriw/teuthology/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/yuriw/teuthology/teuthology/task/install/__init__.py", line 93, in _purge_data
    'rm', '-rf', '--one-file-system', '--', '/var/lib/ceph',
  File "/home/yuriw/teuthology/teuthology/orchestra/remote.py", line 192, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 403, in run
    r.wait()
  File "/home/yuriw/teuthology/teuthology/orchestra/run.py", line 166, in wait
    label=self.label)
CommandFailedError: Command failed on mira097 with status 1: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 
2016-12-02 17:23:03,331.331 ERROR:teuthology.nuke:Could not nuke the following targets:
targets:
  mira097.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDG2dOJzrUSyviG/hX39DV044Im1qvLw+u9bdgGZCaTpJrl2up9lPP7tVKhRFQ2I9i0CRZG+7nZiWH4WicZt6+5tFaPyJpZZJQgyaS9DVd9mOdSVtFzTcMCStztr8ULWjf7MA+c385u5UbW6gxYfpAZIVTpGhIiPAUDOvFP8CNtDusGoB685pKs4vOtppLBEoiRjC78kJ3mAV1omOkTr52+38reZ22FchlaQpkmZXNJ0YM4KVNnMwXuYl5KJpnj3Z4dYcyJqDTSdVw373uoklZwQ11ogRMmQISQzaBjfIbOx8bqSrlm5iMSwvYtDC3V3VjTG0jE5japdK1HNl7g2WkH

Actions #2

Updated by Zack Cerza over 7 years ago

2016-12-02T11:04:12.844 ERROR:teuthology.task.internal:Host mira097 has stale /var/lib/ceph, check lock and nuke/cleanup.
[...]
2016-12-02T11:04:14.087 INFO:teuthology.nuke.actions:Unmounting ceph-fuse and killing daemons...
[...]
2016-12-02T11:07:26.999 INFO:teuthology.orchestra.run.mira097.stderr:rm: cannot remove ‘/var/lib/ceph’: Device or resource busy

This node shouldn't have been unlocked in the first place. I do see the attempt to kill daemons as well. There's definitely a bug here, preventing the daemons from dying.

Actions #3

Updated by Zack Cerza over 7 years ago

After a manual nuke failing:

[root@mira097 ~]# mount | grep ceph
/dev/mapper/all-ceph on /var/lib/ceph type ext4 (rw,relatime,seclabel,data=ordered)
[root@mira097 ~]# grep ceph /etc/fstab
/dev/all/ceph /var/lib/ceph ext4 defaults 1 1

Actions #4

Updated by Zack Cerza over 7 years ago

  • Assignee changed from David Galloway to Loïc Dachary

From the paddles log:

paddles.out.log.4.gz:2016-11-28 07:29:27,685 INFO  [paddles.controllers.nodes] Unlocked <Node mira097.front.sepia.ceph.com> for mgolub with description /home/teuthworker/archive/trociny-2016-11-28_07:01:17-rbd-wip-mgolub-testing-jewel---basic-mira/582025
paddles.out.log.4.gz:2016-11-28 09:30:31,828 INFO  [paddles.controllers.nodes] Locked <Node mira097.front.sepia.ceph.com> for loic@dachary.org
paddles.out.log:2016-12-02 11:02:34,636 DEBUG [paddles.controllers.nodes] Unlocking <Node mira097.front.sepia.ceph.com> for loic@dachary.org
paddles.out.log:2016-12-02 11:02:34,636 INFO  [paddles.controllers.nodes] Unlocked <Node mira097.front.sepia.ceph.com> for loic@dachary.org
paddles.out.log:2016-12-02 11:04:10,848 INFO  [paddles.controllers.nodes] Locked <Node mira097.front.sepia.ceph.com> <Node mira100.front.sepia.ceph.com> for scheduled_kchai@teuthology with description /home/teuthworker/archive/kchai-2016-12-02_07:05:23-rados-wip-kefu-testing---basic-mira/594772

So the machine was used for a job, which passed, then Loic locked it, and the first job that picked it up after it was unlocked failed because of a stale /var/lib/ceph. At this point I think we need to know what Loic did to it :-/

Actions #5

Updated by Loïc Dachary over 7 years ago

I f**ed up the machine big time, in so many ways I can't remember. My intentions were good: figuring out how ceph-disk@.service and ceph-osd@.service could race.

Actions #6

Updated by Zack Cerza over 7 years ago

  • Status changed from New to Resolved
  • Assignee changed from Loïc Dachary to Zack Cerza

Okay. I'll reimage...

Actions

Also available in: Atom PDF