Project

General

Profile

Actions

Bug #25129

open

Race condition in install task

Added by Nathan Cutler over 5 years ago. Updated over 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

When the "branch" option is given to the install task, a race condition can occur, as reproduced here: http://pulpito.ceph.com/smithfarm-2018-07-26_19:04:20-smithfarm-mimic-distro-basic-smithi/ (minimal reproducer) and here: http://pulpito.ceph.com/yuriw-2018-07-24_22:40:04-upgrade:luminous-x-mimic-distro-basic-smithi/ (real-life example)

Somehow (not sure exactly by which mechanism) Teuthology adds the following override to every job:

  overrides:
    install:
      ceph:
        sha1: dd471db8bce26d29051e8c41d2dbd8a2baf5186e

This is the sha1 of the branch given via the --ceph parameter on the teuthology-suite command line. In this case, it's the tip of branch "mimic".

The test yaml further contains:

  tasks:
  - install:
      branch: luminous

When it starts, then, the install task has both "branch: luminous" and "sha1: dd471db8bce26d29051e8c41d2dbd8a2baf5186e" (tip of mimic) in its config dict, causing it to emit a message:

2018-07-26T19:20:51.219 WARNING:teuthology.packaging:More than one of ref, tag, branch, or sha1 supplied; using branch

Next, teuthology queries shaman to get the repo:

2018-07-26T19:20:51.220 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F16.04%2Fx86_64&ref=luminous

Shaman returns a list of repos containing builds from any SHA1 it finds in branch "luminous", and teuthology takes the first result - https://github.com/ceph/teuthology/blob/master/teuthology/packaging.py#L847

    def _get_base_url(self):
        self.assert_result()
return self._result.json()[0]['url']

At this point, nothing has been installed yet. This marks the beginning of the race condition.

2018-07-26T19:20:51.811 INFO:teuthology.task.install.deb:Pulling from https://2.chacra.ceph.com/r/ceph/luminous/0ce17faf47b4165587f0e717e32d469dc8c3f285/ubuntu/xenial/flavors/default/
2018-07-26T19:20:51.814 INFO:teuthology.task.install.deb:Package version is 12.2.7-18-g0ce17fa-1xenial

Now the install task gets to work - calls apt-get install, lots of dependent packages get pulled in, etc. Time passes.

When the packages finish installing, the install task checks to make sure the expected package version was really installed. If, in the meantime, Shaman finishes building another SHA1 from branch "luminous", this check will fail:

2018-07-26T19:22:11.308 WARNING:teuthology.packaging:More than one of ref, tag, branch, or sha1 supplied; using branch
2018-07-26T19:22:11.308 INFO:teuthology.packaging:ref: None
2018-07-26T19:22:11.308 INFO:teuthology.packaging:tag: None
2018-07-26T19:22:11.308 INFO:teuthology.packaging:branch: luminous
2018-07-26T19:22:11.308 INFO:teuthology.packaging:sha1: dd471db8bce26d29051e8c41d2dbd8a2baf5186e
2018-07-26T19:22:11.309 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=ubuntu%2F16.04%2Fx86_64&ref=luminous

This again returns a list of repos, but the first one (which is the only one teuthology looks at) is now different! (12.2.7-19-g3a01e5d-1xenial instead of 12.2.7-18-g0ce17fa-1xenial before apt-get was called)

But the log is silent about this anomaly. Next, the install task uses dpkg-query to see which version of the ceph packages was actually installed:

2018-07-26T19:22:11.882 INFO:teuthology.orchestra.run.smithi092:Running: "dpkg-query -W -f '${Version}' ceph" 
2018-07-26T19:22:11.915 INFO:teuthology.orchestra.run.smithi092.stdout:12.2.7-18-g0ce17fa-1xenial
2018-07-26T19:22:11.915 INFO:teuthology.packaging:The installed version of ceph is 12.2.7-18-g0ce17fa-1xenial

Now the install task reports the anomaly and aborts:

2018-07-26T19:22:11.915 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 30, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/install/__init__.py", line 250, in install
    install_packages(ctx, package_list, config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/install/__init__.py", line 127, in install_packages
    verify_package_version(ctx, config, remote)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/task/install/__init__.py", line 59, in verify_package_version
    pkg=pkg_to_check
RuntimeError: ceph version 12.2.7-19-g3a01e5d-1xenial was not installed, found 12.2.7-18-g0ce17fa-1xenial.

IMO in this case the install task should see that "branch: luminous" was given and "override the override", i.e. overwrite the sha1 from --ceph with the real tip of branch "luminous".


Related issues 1 (0 open1 closed)

Related to teuthology - Feature #24760: Add ability to check/install (/from) chacra.ceph.com and/or download.ceph.com for released ceph versionsResolvedZack Cerza07/03/2018

Actions
Actions #1

Updated by Yuri Weinstein over 5 years ago

  • Related to Feature #24760: Add ability to check/install (/from) chacra.ceph.com and/or download.ceph.com for released ceph versions added
Actions #2

Updated by Nathan Cutler over 5 years ago

  • Description updated (diff)
Actions #3

Updated by Sage Weil over 5 years ago

Why is the install task querying shaman a second time? Seems like that's the source of the race...

Actions #4

Updated by Nathan Cutler over 5 years ago

I agree with Sage that it makes sense to take the first repo/version number returned by shaman and stick to it (i.e. not query Shaman post-install, but rather compare the installed version with the result of the first Shaman query). That fixes the race, and will be less disruptive than insisting on tip-of-branch.

Actions #5

Updated by Nathan Cutler over 5 years ago

  • Description updated (diff)
Actions

Also available in: Atom PDF