Project

General

Profile

Bug #19636

Updated by Nathan Cutler almost 7 years ago

test descriptions:

* upgrade:client-upgrade/hammer-client-x/rbd/{0-cluster/start.yaml 1-install/hammer-client-x.yaml 2-workload/rbd_notification_tests.yaml}
* upgrade:client-upgrade/jewel-client-x/rbd/{0-cluster/start.yaml 1-install/jewel-client-x.yaml 2-workload/rbd_notification_tests.yaml}

test failure reproducible? YES

test URLs:

* http://pulpito.ceph.com/smithfarm-2017-04-16_18:30:46-upgrade:client-upgrade-wip-kraken-backports-distro-basic-vps/1033423/
* http://pulpito.ceph.com/smithfarm-2017-04-16_18:30:46-upgrade:client-upgrade-wip-kraken-backports-distro-basic-vps/1033424/

test cluster:

<pre>
roles:
- - mon.a
- mon.b
- mon.c
- osd.0
- osd.1
- osd.2
- client.0
- - client.1
</pre>

what seems to happen:

# hammer is installed on both nodes
# the "client.1" node is upgraded to kraken (wip-kraken-backports)
# ceph task runs
# the "hammer" version of "rbd/notify_master.sh" is run on client.0 and the "hammer" version of "rbd/notify_slave.sh" is run on client.1 (afaict the hammer and kraken versions of these scripts are identical)

<pre>
2017-04-16T18:44:20.785 INFO:tasks.workunit.client.0.vpm175.stderr:+
dirname /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/notify_master.sh
2017-04-16T18:44:20.786 INFO:tasks.workunit.client.0.vpm175.stderr:+
relpath=/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/../../../src/test/librbd
2017-04-16T18:44:20.787 INFO:tasks.workunit.client.0.vpm175.stderr:+
python /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/../../../src/test/librbd/test_notify.py
master
</pre>

Three hours later, they are stopped (timeout):

<pre>
2017-04-16T21:44:02.648 INFO:tasks.workunit:Stopping ['rbd/notify_master.sh'] on client.0...
2017-04-16T21:44:02.649 INFO:teuthology.orchestra.run.vpm089:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2017-04-16T21:44:02.668 INFO:tasks.workunit:Stopping ['rbd/notify_slave.sh'] on client.1...
2017-04-16T21:44:02.669 INFO:teuthology.orchestra.run.vpm101:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1'
</pre>

And 1/10 of a second later - as if the job fails with an exception: script just gave up without actually doing anything - the 'rbd/notify_master.sh' script fails:

<pre>
2017-04-16T21:44:02.758 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kraken-backports/qa/tasks/workunit.py", line 176, in task
config.get('env'), timeout=timeout)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kraken-backports/qa/tasks/workunit.py", line 450, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status
node=self.hostname, label=self.label

CommandFailedError: Command failed (workunit test rbd/notify_master.sh) on vpm089 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=hammer TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 RBD_FEATURES=13 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/notify_master.sh'
</pre>

Back