Bug #45341
closedHave provision.fog wait for rc.local to stop running
0%
Description
I think what is occasionally happening is a job proceeds as soon as teuthology can SSH to the host. However, rc.local (which sets up networking by bouncing interfaces) isn't done. So as soon as teuthology can reach the host again, it starts its next task. Except the host's NIC gets bounced and breaks any process that relies on networking.
2020-04-29T20:29:24.291 INFO:teuthology.orchestra.run.smithi195:> sudo yum install -y kernel 2020-04-29T20:29:27.158 INFO:teuthology.orchestra.run.smithi195.stdout:CentOS-8 - AppStream 0.0 B/s | 0 B 00:00 2020-04-29T20:29:27.159 INFO:teuthology.orchestra.run.smithi195.stderr:Failed to download metadata for repo 'CentOS-AppStream' 2020-04-29T20:29:27.159 INFO:teuthology.orchestra.run.smithi195.stderr:Error: Failed to download metadata for repo 'CentOS-AppStream' 2020-04-29T20:29:27.159 DEBUG:teuthology.orchestra.run:got remote process result: 1 2020-04-29T20:29:27.159 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/run_tasks.py", line 87, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/run_tasks.py", line 66, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/task/kernel.py", line 1250, in task version = need_to_install_distro(role_remote) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/task/kernel.py", line 746, in need_to_install_distro 'sudo yum install -y kernel' File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 247, in sh proc=self.run(**kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/remote.py", line 203, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 473, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed on smithi195 with status 1: 'sudo yum install -y kernel'
The reason I think rc.local is to blame is because I checked DHCP logs and see a DHCPREQUEST at that exact moment.
Apr 29 20:29:27 store01 dhcpd: DHCPREQUEST for 172.21.15.195 from 0c:c4:7a:88:80:81 via bond0 Apr 29 20:29:27 store01 dhcpd: DHCPACK on 172.21.15.195 to 0c:c4:7a:88:80:81 via bond0
Could we have a new optional parameter added to fog:
config in teuthology.yml that checks if a sentinel file is present? In the Sepia lab's case, we know once /.cephlab_net_configured
is touched, the interface won't be bounced anymore and rc.local should be done doing its thing.
Maybe:
fog: api_token: foo user_token: bar endpoint: http://fog.front.sepia.ceph.com/fog machine_types: smithi wait_for: '/.cephlab_net_configured'
This bug unnecessarily causes a bunch of job failures per day. http://sentry.ceph.com/sepia/teuthology/issues/4474/
Updated by Zack Cerza almost 4 years ago
In the cloud provisioner we use this method:
https://github.com/ceph/teuthology/blob/master/teuthology/provision/cloud/openstack.py#L304-L318
Does that seem like it would work here? Also, does it really need to be optional do you think?
Updated by David Galloway almost 4 years ago
Zack Cerza wrote:
In the cloud provisioner we use this method:
https://github.com/ceph/teuthology/blob/master/teuthology/provision/cloud/openstack.py#L304-L318Does that seem like it would work here? Also, does it really need to be optional do you think?
Yep, that is exactly what I had in mind. I was just thinking it should be optional so we don't break other teuthology clusters in case they're not writing sentinel files or using ceph-cm-ansible.
Updated by Zack Cerza almost 4 years ago
Right, of course, that makes sense. I'd propose using fog.sentinel_file
as the configuration item just to be slightly more descriptive. Would you mind making the fog/ansible change first, and then I can whip up a teuthology branch to test against?
Updated by David Galloway almost 4 years ago
Zack Cerza wrote:
Would you mind making the fog/ansible change first, and then I can whip up a teuthology branch to test against?
/.cephlab_net_configured
is already the exact file we care about for this.Updated by Zack Cerza almost 4 years ago
- Status changed from New to Fix Under Review
Updated by Zack Cerza almost 4 years ago
- Status changed from Fix Under Review to Resolved