Bug #19636
Updated by Nathan Cutler almost 7 years ago
test descriptions: * upgrade:client-upgrade/hammer-client-x/rbd/{0-cluster/start.yaml 1-install/hammer-client-x.yaml 2-workload/rbd_notification_tests.yaml} * upgrade:client-upgrade/jewel-client-x/rbd/{0-cluster/start.yaml 1-install/jewel-client-x.yaml 2-workload/rbd_notification_tests.yaml} test failure reproducible? YES test URLs: * http://pulpito.ceph.com/smithfarm-2017-04-16_18:30:46-upgrade:client-upgrade-wip-kraken-backports-distro-basic-vps/1033423/ * http://pulpito.ceph.com/smithfarm-2017-04-16_18:30:46-upgrade:client-upgrade-wip-kraken-backports-distro-basic-vps/1033424/ test cluster: <pre> roles: - - mon.a - mon.b - mon.c - osd.0 - osd.1 - osd.2 - client.0 - - client.1 </pre> what seems to happen: # hammer is installed on both nodes # the "client.1" node is upgraded to kraken (wip-kraken-backports) # ceph task runs # the "hammer" version of "rbd/notify_master.sh" is run on client.0 and the "hammer" version of "rbd/notify_slave.sh" is run on client.1 (afaict the hammer and kraken versions of these scripts are identical) <pre> 2017-04-16T18:44:20.785 INFO:tasks.workunit.client.0.vpm175.stderr:+ dirname /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/notify_master.sh 2017-04-16T18:44:20.786 INFO:tasks.workunit.client.0.vpm175.stderr:+ relpath=/home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/../../../src/test/librbd 2017-04-16T18:44:20.787 INFO:tasks.workunit.client.0.vpm175.stderr:+ python /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/../../../src/test/librbd/test_notify.py master </pre> Three hours later, they are stopped (timeout): <pre> 2017-04-16T21:44:02.648 INFO:tasks.workunit:Stopping ['rbd/notify_master.sh'] on client.0... 2017-04-16T21:44:02.649 INFO:teuthology.orchestra.run.vpm089:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0' 2017-04-16T21:44:02.668 INFO:tasks.workunit:Stopping ['rbd/notify_slave.sh'] on client.1... 2017-04-16T21:44:02.669 INFO:teuthology.orchestra.run.vpm101:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1' </pre> And 1/10 of a second later - as if the job fails with an exception: script just gave up without actually doing anything - the 'rbd/notify_master.sh' script fails: <pre> 2017-04-16T21:44:02.758 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kraken-backports/qa/tasks/workunit.py", line 176, in task config.get('env'), timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-kraken-backports/qa/tasks/workunit.py", line 450, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 414, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 149, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 171, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed (workunit test rbd/notify_master.sh) on vpm089 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=hammer TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 RBD_FEATURES=13 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rbd/notify_master.sh' </pre>