Bug #53010: cehpadm rm-cluster does not clean up /var/run/ceph - Orchestrator - Ceph

Actions

Copy link

Bug #53010

closed

cehpadm rm-cluster does not clean up /var/run/ceph

Added by Laura Flores over 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Redouane Kachach Elhichou

Category:

cephadm (binary)

Target version:

% Done:

Source:

Tags:

low-hanging-fruit

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

44779

Crash signature (v1):

Crash signature (v2):

Description

teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh']

This API test failure has been occurring across PR Jenkins builds. It appears to show up sporadically-- occasionally a PR will pass the test, but others will fail. This failure does not seem related to any changes in the PRs in which it occurs.

Here, I have copied the Python Traceback as well as some surrounding output to provide more context.

Collecting pytz
  Using cached pytz-2021.3-py2.py3-none-any.whl (503 kB)
Installing collected packages: more-itertools, pytz, jaraco.functools, tempora, repoze.lru, portend, idna, cheroot, chardet, Routes, requests, pyopenssl, PyJWT, CherryPy, ceph, bcrypt
  Attempting uninstall: idna
    Found existing installation: idna 3.3
    Uninstalling idna-3.3:
      Successfully uninstalled idna-3.3
  Attempting uninstall: requests
    Found existing installation: requests 2.26.0
    Uninstalling requests-2.26.0:
      Successfully uninstalled requests-2.26.0
  Attempting uninstall: PyJWT
    Found existing installation: PyJWT 2.3.0
    Uninstalling PyJWT-2.3.0:
      Successfully uninstalled PyJWT-2.3.0
  Running setup.py develop for ceph
  Attempting uninstall: bcrypt
    Found existing installation: bcrypt 3.2.0
    Uninstalling bcrypt-3.2.0:
      Successfully uninstalled bcrypt-3.2.0
Successfully installed CherryPy-13.1.0 PyJWT-2.0.1 Routes-2.4.1 bcrypt-3.1.4 ceph-1.0.0 chardet-4.0.0 cheroot-8.5.2 idna-2.10 jaraco.functools-3.3.0 more-itertools-4.1.0 portend-3.0.0 pyopenssl-21.0.0 pytz-2021.3 repoze.lru-0.7 requests-2.25.1 tempora-4.1.2
/tmp/tmp.mAQFsq4fJ8
Processing /home/jenkins-build/.cache/pip/wheels/d8/81/0a/fae9efd3c9c706cefa25842310896e727a46567f2dc2dac6a8/coverage-4.5.2-cp38-cp38-linux_x86_64.whl
Installing collected packages: coverage
Successfully installed coverage-4.5.2
Cannot find device "ceph-brx" 
2021-10-21 14:21:10,353.353 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-21 14:21:10,354.354 INFO:__main__:
tearing down the cluster...
rm: cannot remove '/var/run/ceph': Permission denied
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']
Traceback (most recent call last):
  File "../qa/tasks/vstart_runner.py", line 1522, in <module>
    exec_test()
  File "../qa/tasks/vstart_runner.py", line 1357, in exec_test
    teardown_cluster()
  File "../qa/tasks/vstart_runner.py", line 1091, in teardown_cluster
    remote.run(args=[os.path.join(SRC_PREFIX, "stop.sh")], timeout=60)
  File "../qa/tasks/vstart_runner.py", line 410, in run
    return self._do_run(**kwargs)
  File "../qa/tasks/vstart_runner.py", line 478, in _do_run
    proc.wait()
  File "../qa/tasks/vstart_runner.py", line 221, in wait
    raise CommandFailedError(self.args, self.exitstatus)
teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh']
find: ‘/home/jenkins-build/build/workspace/ceph-api/build/out’: No such file or directory

Sample run:

https://jenkins.ceph.com/job/ceph-api/26288/consoleFull#962382583c212b007-e891-4176-9ee7-2f60eca393b7

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by Laura Flores over 2 years ago

The issue seems to occur during a "tearing down the cluster..." step.

Successful API test run:

2021-10-19 21:36:10,384.384 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-19 21:36:10,384.384 INFO:__main__:
tearing down the cluster...
2021-10-19 21:36:12,050.050 INFO:__main__:
ceph cluster torn down
2021-10-19 21:36:12,059.059 INFO:__main__:
running vstart.sh now...
2021-10-19 21:37:08,783.783 INFO:__main__:
vstart.sh finished running
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']

Failed API test run:

2021-10-22 02:43:37,352.352 INFO:__main__:Creating cluster with 1 MDS daemons
2021-10-22 02:43:37,353.353 INFO:__main__:
tearing down the cluster...
rm: cannot remove '/var/run/ceph': Permission denied
Using guessed paths /home/jenkins-build/build/workspace/ceph-api/build/lib/ ['/home/jenkins-build/build/workspace/ceph-api/qa', '/home/jenkins-build/teuthology', '/home/jenkins-build/build/workspace/ceph-api/build/lib/cython_modules/lib.3', '/home/jenkins-build/build/workspace/ceph-api/src/pybind']

Perhaps we need to specify SUDO to ensure that /var/run/ceph can be accessed?

Actions

Copy link

Updated by Ernesto Puerta over 2 years ago

Description updated (diff)
Status changed from New to In Progress
Assignee set to Ernesto Puerta

Actions

Copy link

Updated by Ernesto Puerta over 2 years ago

David found that the issue could come from leftovers of this Jenkins job: https://github.com/ceph/ceph-build/pull/1922/#issuecomment-952062596

The underlying issue could be in cephadm, as it seems that cephadm rm-cluster --fsid $FSID --force is not enough for cleaning up all the stuff in /var/run/ceph

Actions

Copy link

Updated by Sebastian Wagner over 2 years ago

Project changed from teuthology to Orchestrator
Subject changed from teuthology.exceptions.CommandFailedError: Command failed with status 1: ['../src/stop.sh'] to cehpadm rm-cluster does not clean up /var/run/ceph
Description updated (diff)
Category changed from QA Suite to cephadm (binary)