Project

General

Profile

Actions

Bug #11297

closed

"AssertionError: fewer data and journal disks than required" in ceph-deploy-hammer-distro-basic-multi run

Added by Yuri Weinstein about 9 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Normal
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.front.sepia.ceph.com/teuthology-2015-03-31_15:00:38-ceph-deploy-hammer-distro-basic-multi/
Jobs - many, 830669 for example
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-31_15:00:38-ceph-deploy-hammer-distro-basic-multi/830669/

2015-03-31T16:44:11.468 INFO:teuthology.orchestra.run.plana92:Running: 'rmdir -- /home/ubuntu/cephtest'
2015-03-31T16:44:11.471 INFO:teuthology.orchestra.run.plana95:Running: 'rmdir -- /home/ubuntu/cephtest'
2015-03-31T16:44:11.477 DEBUG:teuthology.run_tasks:Unwinding manager internal.lock_machines
2015-03-31T16:44:11.477 DEBUG:teuthology.run_tasks:Exception was not quenched, exiting: AssertionError: fewer data and journal disks than required plana17
2015-03-31T16:44:11.478 INFO:teuthology.nuke:Checking targets against current locks
2015-03-31T16:44:11.616 DEBUG:teuthology.nuke:shortname: plana95
Actions #1

Updated by Zack Cerza about 9 years ago

  • Assignee set to Travis Rhoden

Tests that need >4 disks need to be scheduled on mira or burnupi

Actions #2

Updated by Travis Rhoden about 9 years ago

  • Assignee changed from Travis Rhoden to Yuri Weinstein

Since this test suite has roles with 3 OSDs, and matrixes in an override that runs tests with separate block devices for OSD and journal, this test suite needs nodes with at least 6 block devices.

So, as zack pointed out, these tests need to be run on mira or burnupi.

Actions #4

Updated by Yuri Weinstein about 9 years ago

  • Assignee changed from Yuri Weinstein to Travis Rhoden

See on VPS run

Run: http://pulpito.ceph.com/teuthology-2015-04-07_18:15:26-ceph-deploy:rados-hammer-distro-basic-vps/
Jobs: many, for example 840644
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-04-07_18:15:26-ceph-deploy:rados-hammer-distro-basic-vps/840644/

2015-04-07T19:01:13.445 INFO:teuthology.orchestra.run.vpm165.stderr:[vpm047][DEBUG ] detect machine type
2015-04-07T19:01:13.449 INFO:teuthology.orchestra.run.vpm165.stderr:[ceph_deploy.install][INFO  ] Distro info: Ubuntu 12.04 precise
2015-04-07T19:01:13.449 INFO:teuthology.orchestra.run.vpm165.stderr:[vpm047][INFO  ] purging data on vpm047
2015-04-07T19:01:13.451 INFO:teuthology.orchestra.run.vpm165.stderr:[vpm047][INFO  ] Running command: sudo rm -rf --one-file-system -- /var/lib/ceph
2015-04-07T19:01:13.463 INFO:teuthology.orchestra.run.vpm165.stderr:[vpm047][INFO  ] Running command: sudo rm -rf --one-file-system -- /etc/ceph/
2015-04-07T19:01:13.492 INFO:teuthology.orchestra.run.vpm165.stderr:Unhandled exception in thread started by
2015-04-07T19:01:13.495 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 28, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip_ceph_deploy/tasks/ceph_deploy.py", line 251, in build_ceph_cluster
    node_dev_list = get_dev_for_osd(ctx, config)
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip_ceph_deploy/tasks/ceph_deploy.py", line 119, in get_dev_for_osd
    assert num_devs_reqd <= len(devs), 'fewer data and journal disks than required ' + shortname
AssertionError: fewer data and journal disks than required vpm047
2015-04-07T19:01:13.497 INFO:tasks.ceph_deploy:Removing ceph-deploy ...
2015-04-07T19:01:13.497 INFO:teuthology.orchestra.run.vpm165:Running: 'rm -rf /home/ubuntu/cephtest/ceph-deploy'
2015-04-07T19:01:13.584 INFO:teuthology.task.install:Removing shipped files: /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits...
2015-04-07T19:01:13.584 INFO:teuthology.orchestra.run.vpm047:Running: 'sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits'
2015-04-07T19:01:13.588 INFO:teuthology.orchestra.run.vpm165:Running: 'sudo rm -f -- /home/ubuntu/cephtest/valgrind.supp /usr/bin/daemon-helper /usr/bin/adjust-ulimits'
2015-04-07T19:01:13.634 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 55, in run_tasks
    manager.__enter__()
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip_ceph_deploy/tasks/ceph_deploy.py", line 477, in task
    lambda: build_ceph_cluster(ctx=ctx, config=dict(
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 28, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip_ceph_deploy/tasks/ceph_deploy.py", line 251, in build_ceph_cluster
    node_dev_list = get_dev_for_osd(ctx, config)
  File "/var/lib/teuthworker/src/ceph-qa-suite_wip_ceph_deploy/tasks/ceph_deploy.py", line 119, in get_dev_for_osd
    assert num_devs_reqd <= len(devs), 'fewer data and journal disks than required ' + shortname
AssertionError: fewer data and journal disks than required vpm047
2015-04-07T19:01:13.663 ERROR:teuthology.run_tasks: Sentry event: http://sentry.ceph.com/sepia/teuthology/search?q=73589e3458924aefb870c61fe0086a93
AssertionError: fewer data and journal disks than required vpm047
2015-04-07T19:01:13.663 DEBUG:teuthology.run_tasks:Unwinding manager ssh_keys
2015-04-07T19:01:13.663 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested
    yield vars
  File "/home/teuthworker/src/teuthology_master/teuthology/task/ssh_keys.py", line 179, in task
    yield
AssertionError: fewer data and journal disks than required vpm047
2015-04-07T19:01:13.663 INFO:teuthology.task.ssh_keys:Cleaning up SSH keys
2015-04-07T19:01:13.663 INFO:teuthology.task.ssh_keys:cleaning up keys added for testing

Actions #5

Updated by Travis Rhoden about 9 years ago

Let's discuss this during the infra stand up today. The piece that I keep missing is "what" we are trying to test. What I mean by this is the following:

Does the test require a certain number of nodes to be useful?
A certain number of disks?
Do we need more than on OSD per node?
Do we require 3 copies? or Does 2 work?

The Hammer branch was running tests that used 3 nodes with 3 OSDs each. Because the test suite pulled in enable_diff_journal_disk.yaml, there were certain variations of the tests that required 6 block devices. That means the tests can't run on on VPS or mira nodes.

The most recent run that Yuri did was run off of wip_ceph_deploy, which reduced the node count to 2, with 2 OSDs each. However, when scheduled against VPS nodes, this will still fail:

2015-04-07T18:56:35.993 INFO:teuthology.orchestra.run.vpm046:Running: 'ls /dev/[sv]d?'
2015-04-07T18:56:36.101 WARNING:teuthology.misc:Removing root device: /dev/vda from device list
2015-04-07T18:56:37.350 DEBUG:teuthology.misc:devs=['/dev/vdb', '/dev/vdc', '/dev/vdd']

You can see that a VPS node has three disks available -- /dev/sd[bcd]. That means that if we are going to keep enable_diff_journal_disk.yaml enabled, you can have at most 1 OSD per node.

If the tests don't care how many OSD there are (this isn't a performance test), it seems to me like the best solution would be to change the roles to the following:

roles:
- - mon.a
  - mds.0
  - osd.0
- - osd.1
  - mon.b
- - mon.c
  - osd.2
- client.0

This most closely resembles the original test, but merely reduces the per-node OSD count from 3 to 1. It should then be able to run on mira, plana, burnupi, and vps nodes. We also don't have to change the osd pool default size.

Thoughts?

Actions #6

Updated by Zack Cerza about 9 years ago

Possibly useful: #10767

Actions #7

Updated by Yuri Weinstein about 9 years ago

Run off wip_ceph_deploy - all passed but two
with 3 node, one osd/node configuration:

- - mon.a
  - mds.0
  - osd.0
- - osd.1
  - mon.b
- - mon.c
  - osd.2
  - client.0

http://pulpito.ceph.com/teuthology-2015-04-09_09:12:48-ceph-deploy:rados-hammer-distro-basic-vps/

failed two due to vps clock problem (unrelated)

Also see https://github.com/ceph/ceph-qa-suite/pull/400

Run on burnupi,mira
http://pulpito.ceph.com/teuthology-2015-04-09_14:04:38-ceph-deploy:rados-hammer-distro-basic-multi/

Actions #8

Updated by Dan Mick almost 9 years ago

  • Regression set to No

Travis, what's up with this?

Actions #9

Updated by Dan Mick almost 9 years ago

  • Assignee changed from Travis Rhoden to Yuri Weinstein
  • Priority changed from Urgent to Normal
Actions #10

Updated by Yuri Weinstein almost 9 years ago

  • Status changed from New to 7

Note to self: Is it still happening ?

Actions #11

Updated by Zack Cerza almost 9 years ago

Yuri Weinstein wrote:

Note to self: Is it still happening ?

From paddles:

>>> Job.query.filter(Job.failure_reason.contains("fewer data and journal disks than required")).order_by(Job.updated.desc())[0]
<Job u'teuthology-2015-04-28_10:09:12-ceph-deploy:rados-hammer-distro-basic-vps' u'866886'>

Doesn't look like it.

Actions #12

Updated by Zack Cerza over 8 years ago

  • Status changed from 7 to Need More Info
Actions #13

Updated by Dan Mick over 8 years ago

  • Status changed from Need More Info to Closed

reopen if it occurs again.

Actions

Also available in: Atom PDF