Project

General

Profile

Bug #53010

cehpadm rm-cluster does not clean up /var/run/ceph

Added by Laura Flores about 1 year ago. Updated 9 months ago.

Status:
Resolved
Priority:
Normal
Category:
cephadm (binary)
Target version:
-
% Done:

0%

Source:
Tags:
low-hanging-fruit
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh']

This API test failure has been occurring across PR Jenkins builds. It appears to show up sporadically-- occasionally a PR will pass the test, but others will fail. This failure does not seem related to any changes in the PRs in which it occurs.

Here, I have copied the Python Traceback as well as some surrounding output to provide more context.

Collecting pytz
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Installing collected packages: more-itertools, pytz, jaraco.functools, tempora, repoze.lru, portend, idna, cheroot, chardet, Routes, requests, pyopenssl, PyJWT, CherryPy, ceph, bcrypt
  Attempting uninstall: idna
    Found existing installation: idna 3.3
    Uninstalling idna-3.3:
      Successfully uninstalled idna-3.3
  Attempting uninstall: requests
    Found existing installation: requests 2.26.0
    Uninstalling requests-2.26.0:
      Successfully uninstalled requests-2.26.0
  Attempting uninstall: PyJWT
    Found existing installation: PyJWT 2.3.0
    Uninstalling PyJWT-2.3.0:
      Successfully uninstalled PyJWT-2.3.0
  Running setup.py develop for ceph
  Attempting uninstall: bcrypt
    Found existing installation: bcrypt 3.2.0
    Uninstalling bcrypt-3.2.0:
      Successfully uninstalled bcrypt-3.2.0
Successfully installed CherryPy-13.1.0 PyJWT-2.0.1 Routes-2.4.1 bcrypt-3.1.4 ceph-1.0.0 chardet-4.0.0 cheroot-8.5.2 idna-2.10 jaraco.functools-3.3.0 more-itertools-4.1.0 portend-3.0.0 pyopenssl-21.0.0 pytz-2021.3 repoze.lru-0.7 requests-2.25.1 tempora-4.1.2
/tmp/tmp.mAQFsq4fJ8
Processing /home/jenkins-build/.cache/pip/wheels/d8/81/0a/fae9efd3c9c706cefa25842310896e727a46567f2dc2dac6a8/coverage-4.5.2-cp38-cp38-linux_x86_64.whl
Installing collected packages: coverage
Successfully installed coverage-4.5.2
Cannot find device "ceph-brx" 
2021-10-21 14:21:10,353.353 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-21 14:21:10,354.354 INFO:__main__:
tearing down the cluster...
rm: cannot remove '/var/run/ceph': Permission denied
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']
Traceback (most recent call last):
  File "../qa/tasks/vstart_runner.py", line 1522, in <module>
    exec_test()
  File "../qa/tasks/vstart_runner.py", line 1357, in exec_test
    teardown_cluster()
  File "../qa/tasks/vstart_runner.py", line 1091, in teardown_cluster
    remote.run(args=[os.path.join(SRC_PREFIX, "stop.sh")], timeout=60)
  File "../qa/tasks/vstart_runner.py", line 410, in run
    return self._do_run(**kwargs)
  File "../qa/tasks/vstart_runner.py", line 478, in _do_run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 221, in wait
    raise CommandFailedError(self.args, self.exitstatus)
teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh']
find: ‘/home/jenkins-build/build/workspace/ceph-api/build/out’: No such file or directory
Sample run:

Related issues

Related to Orchestrator - Bug #46655: cephadm rm-cluster: Systemd ceph.target not deleted Resolved
Related to Orchestrator - Bug #44669: cephadm: rm-cluster should clean up /etc/ceph Resolved
Related to Orchestrator - Feature #53815: cephadm rm-cluster should delete log files Resolved
Related to Orchestrator - Bug #54018: Suspicious behavior when deleting a cluster (by running cephadm rm-cluster) Resolved
Related to Orchestrator - Bug #54142: quincy cephadm-purge-cluster needs work Resolved

History

#1 Updated by Laura Flores about 1 year ago

The issue seems to occur during a "tearing down the cluster..." step.

Successful API test run:

2021-10-19 21:36:10,384.384 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-19 21:36:10,384.384 INFO:__main__:
tearing down the cluster...
2021-10-19 21:36:12,050.050 INFO:__main__:
ceph cluster torn down
2021-10-19 21:36:12,059.059 INFO:__main__:
running vstart.sh now...
2021-10-19 21:37:08,783.783 INFO:__main__:
vstart.sh finished running
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']

Failed API test run:

2021-10-22 02:43:37,352.352 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-22 02:43:37,353.353 INFO:__main__:
tearing down the cluster...
rm: cannot remove '/var/run/ceph': Permission denied
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/teuthology', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']

Perhaps we need to specify SUDO to ensure that /var/run/ceph can be accessed?

#2 Updated by Ernesto Puerta about 1 year ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to Ernesto Puerta

#3 Updated by Ernesto Puerta about 1 year ago

David found that the issue could come from leftovers of this Jenkins job: https://github.com/ceph/ceph-build/pull/1922/#issuecomment-952062596

The underlying issue could be in cephadm, as it seems that cephadm rm-cluster --fsid $FSID --force is not enough for cleaning up all the stuff in /var/run/ceph

#4 Updated by Sebastian Wagner about 1 year ago

  • Project changed from teuthology to Orchestrator
  • Subject changed from teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh'] to cehpadm rm-cluster does not clean up /var/run/ceph
  • Description updated (diff)
  • Category changed from QA Suite to cephadm (binary)

#5 Updated by Sebastian Wagner about 1 year ago

seems as if cephadm doesn not clean up /var/run/ceph

#6 Updated by Sebastian Wagner about 1 year ago

  • Related to Bug #46655: cephadm rm-cluster: Systemd ceph.target not deleted added

#7 Updated by Sebastian Wagner 11 months ago

  • Status changed from In Progress to New
  • Assignee deleted (Ernesto Puerta)

#8 Updated by Sebastian Wagner 11 months ago

  • Related to Bug #44669: cephadm: rm-cluster should clean up /etc/ceph added

#9 Updated by Sebastian Wagner 11 months ago

  • Related to Feature #53815: cephadm rm-cluster should delete log files added

#10 Updated by Sebastian Wagner 10 months ago

  • Tags set to low-hanging-fruit

#11 Updated by Redouane Kachach Elhichou 10 months ago

  • Assignee set to Redouane Kachach Elhichou

#12 Updated by Redouane Kachach Elhichou 10 months ago

  • Related to Bug #54018: Suspicious behavior when deleting a cluster (by running cephadm rm-cluster) added

#13 Updated by Redouane Kachach Elhichou 10 months ago

  • Status changed from New to Fix Under Review

#15 Updated by Redouane Kachach Elhichou 10 months ago

  • Status changed from Fix Under Review to Closed

#16 Updated by Redouane Kachach Elhichou 9 months ago

  • Status changed from Closed to Resolved

#17 Updated by Redouane Kachach Elhichou 9 months ago

  • Pull request ID set to 44779

#18 Updated by Redouane Kachach Elhichou 8 months ago

  • Related to Bug #54142: quincy cephadm-purge-cluster needs work added

Also available in: Atom PDF