Project

General

Profile

Actions

Bug #45341

closed

Have provision.fog wait for rc.local to stop running

Added by David Galloway almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I think what is occasionally happening is a job proceeds as soon as teuthology can SSH to the host. However, rc.local (which sets up networking by bouncing interfaces) isn't done. So as soon as teuthology can reach the host again, it starts its next task. Except the host's NIC gets bounced and breaks any process that relies on networking.

From: http://qa-proxy.ceph.com/teuthology/yuriw-2020-04-29_18:48:38-rbd-wip-yuri2-testing-2020-04-29-1652-octopus-distro-basic-smithi/4998169/teuthology.log

2020-04-29T20:29:24.291 INFO:teuthology.orchestra.run.smithi195:> sudo yum install -y kernel
2020-04-29T20:29:27.158 INFO:teuthology.orchestra.run.smithi195.stdout:CentOS-8 - AppStream                            0.0  B/s |   0  B     00:00
2020-04-29T20:29:27.159 INFO:teuthology.orchestra.run.smithi195.stderr:Failed to download metadata for repo 'CentOS-AppStream'
2020-04-29T20:29:27.159 INFO:teuthology.orchestra.run.smithi195.stderr:Error: Failed to download metadata for repo 'CentOS-AppStream'
2020-04-29T20:29:27.159 DEBUG:teuthology.orchestra.run:got remote process result: 1
2020-04-29T20:29:27.159 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/run_tasks.py", line 87, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/run_tasks.py", line 66, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/task/kernel.py", line 1250, in task
    version = need_to_install_distro(role_remote)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/task/kernel.py", line 746, in need_to_install_distro
    'sudo yum install -y kernel'
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 247, in sh
    proc=self.run(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 203, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 473, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi195 with status 1: 'sudo yum install -y kernel'

The reason I think rc.local is to blame is because I checked DHCP logs and see a DHCPREQUEST at that exact moment.

Apr 29 20:29:27 store01 dhcpd: DHCPREQUEST for 172.21.15.195 from 0c:c4:7a:88:80:81 via bond0
Apr 29 20:29:27 store01 dhcpd: DHCPACK on 172.21.15.195 to 0c:c4:7a:88:80:81 via bond0

Could we have a new optional parameter added to fog: config in teuthology.yml that checks if a sentinel file is present? In the Sepia lab's case, we know once /.cephlab_net_configured is touched, the interface won't be bounced anymore and rc.local should be done doing its thing.

Maybe:

fog:
  api_token: foo
  user_token: bar
  endpoint: http://fog.front.sepia.ceph.com/fog
  machine_types: smithi
  wait_for: '/.cephlab_net_configured'

This bug unnecessarily causes a bunch of job failures per day. http://sentry.ceph.com/sepia/teuthology/issues/4474/

Actions #1

Updated by Zack Cerza almost 4 years ago

In the cloud provisioner we use this method:
https://github.com/ceph/teuthology/blob/master/teuthology/provision/cloud/openstack.py#L304-L318

Does that seem like it would work here? Also, does it really need to be optional do you think?

Actions #2

Updated by David Galloway almost 4 years ago

Zack Cerza wrote:

In the cloud provisioner we use this method:
https://github.com/ceph/teuthology/blob/master/teuthology/provision/cloud/openstack.py#L304-L318

Does that seem like it would work here? Also, does it really need to be optional do you think?

Yep, that is exactly what I had in mind. I was just thinking it should be optional so we don't break other teuthology clusters in case they're not writing sentinel files or using ceph-cm-ansible.

Actions #3

Updated by Zack Cerza almost 4 years ago

Right, of course, that makes sense. I'd propose using fog.sentinel_file as the configuration item just to be slightly more descriptive. Would you mind making the fog/ansible change first, and then I can whip up a teuthology branch to test against?

Actions #4

Updated by David Galloway almost 4 years ago

Zack Cerza wrote:

Would you mind making the fog/ansible change first, and then I can whip up a teuthology branch to test against?

/.cephlab_net_configured is already the exact file we care about for this.
Actions #5

Updated by Zack Cerza almost 4 years ago

  • Status changed from New to Fix Under Review
Actions #6

Updated by Zack Cerza almost 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF