Project

General

Profile

Actions

Bug #5559

closed

ARM rbd command CommandFailedError in teuthology

Added by Anonymous almost 11 years ago. Updated about 10 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
-
Category:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

rbd cannot connect to the cluster from a teuthology test.

This problem can be reproduced on tala002 and tala004 doing the following:

Step 1: Power cycle tala002 and tala004.

Step 2: Make sure that things are clean on these machines so that teuthology does not complain:
cd
sudo dpkg --configure -a
sudo rm -fr /var/lib/ceph
sudo rm -fr cephtest/*
sudo rm -fr /var/log/ceph
sudo mkdir /var/log/ceph

Step 3: run teuthology, using the following yaml file:

machine_type: tala
interactive-on-error: true
roles:
- [mon.a, mon.b, mon.c, mds.a, osd.0, osd.1, osd.2, osd.3,]
- [client.0]
tasks:
- ceph:
- rbd:
    all:
- workunit:
    clients:
      all:
        - kernel_untar_build.sh
targets:
  ubuntu@tala002.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQD5mFt7raxufuhfx3dxQDJg5mzJ4N+n94rHC/pEqCFvXSp5Fly9cZZxdmn6N5vNUerXIt7/ui2AlVii/bSNjBJrXGYwi+IK+tRPpHb1e5OaS1FdNeHHIeIofeTmUVC7wzsit7sWCcN0I+FjlVqWjXs4qsjI56MbAMC+YVAepbhOUT/j8tFFLXgMN4xFKx10G4TqGWJqsMA1+WD4DLHWI8GrqccGTdokzaotSFHH3uMJIzXfTpCLts1n6yX2iogmK2ayFyD7TmMPRI9ZQ2E5yvkMsYrAOyyPp7h3RVGRRYWR47mmdrENfjuVKQcK30tBSO3tl13BXxWNl1+rfMOk9Cqz
  ubuntu@tala004.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4XQUem3ze9TfBfsJ3pL8kPm+Y98TEJDQ76rOcdjMR4Rs8mte1Q1B93hT0CI8uRjFcv9uiKaOlweiqKXSx6N/20dsPQ2LN54FtXLB346vsxDmZH0RRzg7KfHja/AilEW3pN3nlLlYkCN/9yWuId3g1sN1L6Shylyc96OL2b++O5fZhZnzbbaHSvyngU73GY/sfRWWA6bB6suXRe/QMbHA/ge/+EvcjJ74nZynenujAchjcVmY6xzpXsXYtSSpYcdgkVh+7P1H0KkfWJwH8aRvsni7TE/6Zp8AtaROelCW1v5vMaLAUjjFtz2nVy2KSViktX3jIpwHDXoFd3eJumXxT

This produces the follow output (extraneous stuff removed):

2013-07-09T12:17:05.821 DEBUG:teuthology.misc:Ceph health: HEALTH_OK
2013-07-09T12:17:05.822 INFO:teuthology.run_tasks:Running task rbd...
2013-07-09T12:17:05.823 DEBUG:teuthology.task.rbd:rbd config is: {'client.0': None}
2013-07-09T12:17:05.823 DEBUG:teuthology.misc:basedir: /home/ubuntu/cephtest
2013-07-09T12:17:05.823 INFO:teuthology.task.rbd:Creating image testimage.client.0 with size 10240
2013-07-09T12:17:05.824 DEBUG:teuthology.orchestra.run:Running [10.214.143.4]: '/home/ubuntu/cephtest/wu1307091216/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/wu1307091216/archive/coverage rbd -p rbd create --size 10240 testimage.client.0'
2013-07-09T12:17:06.244 INFO:teuthology.orchestra.run.err:rbd: couldn't connect to the cluster!
2013-07-09T12:17:06.245 INFO:teuthology.orchestra.run.err:2013-07-09 12:17:06.066417 b6f6a2a0  0 monclient(hunting): authenticate timed out after 1.49351e-154
2013-07-09T12:17:06.245 INFO:teuthology.orchestra.run.err:2013-07-09 12:17:06.066567 b6f6a2a0  0 librados: client.admin authentication error (110) Connection timed out
2013-07-09T12:17:07.645 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/wusui/src/teuthology/teuthology/contextutil.py", line 25, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/wusui/src/teuthology/teuthology/task/rbd.py", line 70, in create_image
    remote.run(args=args)
  File "/home/wusui/src/teuthology/teuthology/orchestra/remote.py", line 43, in run
    r = self._runner(client=self.ssh, **kwargs)
  File "/home/wusui/src/teuthology/teuthology/orchestra/run.py", line 266, in run
    r.exitstatus = _check_status(r.exitstatus)
  File "/home/wusui/src/teuthology/teuthology/orchestra/run.py", line 262, in _check_status
    raise CommandFailedError(command=r.command, exitstatus=status, node=host)
CommandFailedError: Command failed on 10.214.143.4 with status 1: '/home/ubuntu/cephtest/wu1307091216/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/wu1307091216/archive/coverage rbd -p rbd create --size 10240 testimage.client.0'
2013-07-09T12:17:07.648 ERROR:teuthology.run_tasks:Saw exception from tasks
Traceback (most recent call last):
  File "/home/wusui/src/teuthology/teuthology/run_tasks.py", line 27, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/wusui/src/teuthology/teuthology/task/rbd.py", line 605, in task
    lambda: mount(ctx=ctx, config=role_images),
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/wusui/src/teuthology/teuthology/contextutil.py", line 25, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/wusui/src/teuthology/teuthology/task/rbd.py", line 70, in create_image
    remote.run(args=args)
  File "/home/wusui/src/teuthology/teuthology/orchestra/remote.py", line 43, in run
    r = self._runner(client=self.ssh, **kwargs)
  File "/home/wusui/src/teuthology/teuthology/orchestra/run.py", line 266, in run
    r.exitstatus = _check_status(r.exitstatus)
  File "/home/wusui/src/teuthology/teuthology/orchestra/run.py", line 262, in _check_status
    raise CommandFailedError(command=r.command, exitstatus=status, node=host)
CommandFailedError: Command failed on 10.214.143.4 with status 1: '/home/ubuntu/cephtest/wu1307091216/adjust-ulimits ceph-coverage /home/ubuntu/cephtest/wu1307091216/archive/coverage rbd -p rbd create --size 10240 testimage.client.0'
2013-07-09T12:17:07.652 WARNING:teuthology.run_tasks:Saw failure, going into interactive mode...
2013-07-09T12:27:18.494 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x298d250>
2013-07-09T12:27:18.494 ERROR:teuthology.contextutil:Saw exception from nested tasks

Note that if the rbd command is run manually, it seems to work.

Actions #1

Updated by Anonymous almost 11 years ago

  • Assignee set to Josh Durgin

Note that this could be related to other problems with the ARM kernel. The
following was output on the console while this test was run.

[  168.139917] huh, entered softirq 4 BLOCK c0262c3c preempt_count 00000100, exited with 00000000?
[  445.430452] huh, entered softirq 4 BLOCK c0262c3c preempt_count 00000100, exited with 00000000?
Actions #2

Updated by Anonymous almost 11 years ago

This could very well be one of the Kernel problems that we already detected. Fresh reinstallation attempts cause that BUG: scheduling while atomic error
to occur on the same rbd call.

Actions #3

Updated by Anonymous almost 11 years ago

  • Assignee changed from Josh Durgin to Anonymous

The problems shown here were greatly exacerbated by some problems in my yaml files while running indvidual test. I have corrected those issues and now can execute many of the rbd kernel tests.

The problems still occur, but a lot less frequently than before. The kernels that we build are based off of 3.10 versions on gitbuilder. Rossen said that they recommend using the 3.5.0-1000-highbank kernel which appears to be several releases earlier than ours. If we are to build on this version, we need to have a new gitbuilder directory for this file.

Actions #4

Updated by Anonymous almost 11 years ago

  • Status changed from New to In Progress

Behavior is inconsistent between builds.

Actions #5

Updated by Anonymous about 10 years ago

  • Status changed from In Progress to Won't Fix
Actions

Also available in: Atom PDF