Project

General

Profile

Actions

Bug #9027

closed

Failed "create unique_pool_0 16 16 erasure teuthologyprofile" in upgrade:dumpling-firefly-x-next-testing-basic-vps suite

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
% Done:

100%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Tests failed in upgrade:dumpling-firefly-x-next-testing-basic-vps suite with EC enabled.

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-08-05_19:05:01-upgrade:dumpling-firefly-x-next-testing-basic-vps/402505/

2014-08-05T22:28:52.569 INFO:teuthology.task.mon_thrash.mon_thrasher:waiting for 1.0 secs before continuing thrashing
2014-08-05T22:28:53.569 INFO:teuthology.task.mon_thrash.ceph_manager:waiting for quorum size 3
2014-08-05T22:28:53.569 INFO:teuthology.orchestra.run.vpm068:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status'
2014-08-05T22:28:53.767 INFO:teuthology.task.mon_thrash.ceph_manager:quorum_status is {"election_epoch":232,"quorum":[0,1,2],"quorum_names":["b","a","c"],"quorum_leader_name":"b","monmap":{"epoch":1,"fsid":"75400fef-e668-48f0-aae8-ecdd8751422a","modified":"2014-08-06 02:49:06.214815","created":"2014-08-06 02:49:06.214815","mons":[{"rank":0,"name":"b","addr":"10.214.138.129:6789\/0"},{"rank":1,"name":"a","addr":"10.214.138.139:6789\/0"},{"rank":2,"name":"c","addr":"10.214.138.129:6790\/0"}]}}

2014-08-05T22:28:53.768 INFO:teuthology.task.mon_thrash.ceph_manager:quorum is size 3
2014-08-05T22:28:53.768 DEBUG:teuthology.run_tasks:Unwinding manager rados
2014-08-05T22:28:53.768 INFO:teuthology.task.rados:joining rados
2014-08-05T22:28:53.768 ERROR:teuthology.run_tasks:Manager failed: rados
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_next/teuthology/run_tasks.py", line 105, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/teuthworker/src/teuthology_next/teuthology/task/rados.py", line 200, in task
    running.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
AssertionError
2014-08-05T22:28:53.769 DEBUG:teuthology.run_tasks:Unwinding manager rados
2014-08-05T22:28:53.769 INFO:teuthology.task.rados:joining rados
2014-08-05T22:28:53.769 ERROR:teuthology.run_tasks:Manager failed: rados
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_next/teuthology/run_tasks.py", line 105, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/teuthworker/src/teuthology_next/teuthology/task/rados.py", line 200, in task
    running.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
CommandFailedError: Command failed on vpm068 with status 22: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd pool create unique_pool_0 16 16 erasure teuthologyprofile'
archive_path: /var/lib/teuthworker/archive/teuthology-2014-08-05_19:05:01-upgrade:dumpling-firefly-x-next-testing-basic-vps/402505
branch: next
description: upgrade:dumpling-firefly-x/parallel/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/{rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml}
  3-firefly-upgrade/firefly.yaml 4-workload/{rados_api.yaml rados_loadgenbig.yaml
  test_rbd_api.yaml test_rbd_python.yaml} 5-upgrade-sequence/upgrade-by-daemon.yaml
  6-final-workload/{ec-readwrite.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml
  rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_s3tests.yaml rgw_swift.yaml}
  distros/ubuntu_12.04.yaml}
email: ceph-qa@ceph.com
job_id: '402505'
kernel: &id001
  kdb: true
  sha1: 967166011221589288348b893720d358150176b9
last_in_suite: false
machine_type: vps
name: teuthology-2014-08-05_19:05:01-upgrade:dumpling-firefly-x-next-testing-basic-vps
nuke-on-error: true
os_type: ubuntu
os_version: '12.04'
overrides:
  admin_socket:
    branch: next
  ceph:
    conf:
      global:
        osd heartbeat grace: 40
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    sha1: dceab8dc49d4f458e3bc9fd40eb8b487f1e35948
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: dceab8dc49d4f458e3bc9fd40eb8b487f1e35948
  s3tests:
    branch: next
  workunit:
    sha1: dceab8dc49d4f458e3bc9fd40eb8b487f1e35948
owner: scheduled_teuthology@teuthology
priority: 1000
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
suite: upgrade:dumpling-firefly-x
suite_branch: master
targets:
  ubuntu@vpm068.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCu9XeLYF+bwlBoMJ5PvYBve6WQp7YUJefyhPy43SGxPwdep80xM3XievRLw+pT3D1jEjNhteWY5sNlwu3FbbcLJfymRcRTYMyzUEcraI9V+w1SH8Zcr0Xk/axGLRvfj0hXxc1GBlrIP5r25hROtGOl3xG2vIxvKtjUzLZ82SXDjbKEWSW2Vqu3fh+yHgGEEsbc4L+XlWLd5T3IQEQpJ3jikqqTIlJozO24TEd4SgTonA45Kn5zWD3F5KIIcwcBT8xZTKTmBiZ/mACMBjOGxrWWo1JQ9KNOYXCNPiP917OlXK51a1i225FNxP8JNl+S/iP7QZKQXtB24Zpo1Xpmo1LF
  ubuntu@vpm071.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+SPcXJRlWMjKtD2N72o3M8TOKhhV3zze4se57HFyW17zvRYduLgMqTgZwoSa2HP6qs/uKxaWJhBt1W5DGagIJ9p1XZpJv8c9bukpAL+JytDWTdhmPzmW/FtLevNECmog9wGhn2APSMIySjpqXvHZzLqssgDXsko6rO7YH1auYyo1ef9lm9XM2vDoo376sA3lh4T6cajL4GoK30ikk+gBR/raMWhZhlfVvP97kFGs+0Yc/ZOOOs/fikOTMEVIxsnPgV8mlIJj5Z2HvnaOHf3ny6SHVhqmhELCCd+wgdtdjy6rOSVSnT0FhmegQV4yRnaE0Gy5YFkQAImmX0RRPoe99
  ubuntu@vpm072.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEkYA0FL84PKCSYqnRqM8MafXqgSwp1sTxLU7Lbus2Xw3AKYXrrRNzHbnaGl12R5q7TfYdXCwE7gd5MT4RE5dSbZNHm361qFsRUq+9Kj7ScBjby4XY28b0Ce8Pkg84If2umsdExssdqAzvkDlJgEaO15RnxDAGKLt3N3jwCSM/y0ajqjToG9MvrRgvRfiYxfNhrU4Iwlr1nDY11nJ6eaBZIZhT1yUOsynCzRp3ubFkSHI8ZnpyInNJGdjob+DN81Up+DpXaKwRBl7tvMDyczQ8hAAiPGkpQ8Cg5mWnv4SbxTZyGzVy4FpGEpHWSh72mEsT8fFNHaZVnCxb3bUV/Bo3
tasks:
- internal.lock_machines:
  - 3
  - vps
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.serialize_remote_roles: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.sudo: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: dumpling
- print: '**** done dumpling install'
- ceph:
    fs: xfs
- parallel:
  - workload
- print: '**** done parallel'
- install.upgrade:
    client.0:
      branch: firefly
    mon.a:
      branch: firefly
    mon.b:
      branch: firefly
- print: '**** done install.upgrade'
- ceph.restart: null
- print: '**** done restart'
- parallel:
  - workload2
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- print: '**** done install.upgrade client.0 to the version from teuthology-suite
    arg'
- rados:
    clients:
    - client.0
    ec_pool: true
    objects: 500
    op_weights:
      append: 45
      delete: 10
      read: 45
      write: 0
    ops: 4000
- rados:
    clients:
    - client.1
    objects: 50
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
- workunit:
    clients:
      client.1:
      - rados/load-gen-mix.sh
- mon_thrash:
    revive_delay: 20
    thrash_delay: 1
- workunit:
    clients:
      client.1:
      - rados/test.sh
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
- workunit:
    clients:
      client.1:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- rgw:
  - client.1
- s3tests:
    client.1:
      rgw_server: client.1
- swift:
    client.1:
      rgw_server: client.1
teuthology_branch: next
tube: vps
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
  - print: '**** done install.upgrade mon.a to the version from teuthology-suite arg'
  - install.upgrade:
      mon.b: null
  - print: '**** done install.upgrade mon.b to the version from teuthology-suite arg'
  - ceph.restart:
      daemons:
      - mon.a
  - sleep:
      duration: 60
  - ceph.restart:
      daemons:
      - mon.b
  - sleep:
      duration: 60
  - ceph.restart:
    - mon.c
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.0
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.1
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.2
  - sleep:
      duration: 60
  - ceph.restart:
    - osd.3
  - sleep:
      duration: 60
  - ceph.restart:
    - mds.a
  - exec:
      mon.a:
      - ceph osd crush tunables firefly
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.vps.8709
workload:
  sequential:
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/test.sh
        - cls
  - print: '**** done rados/test.sh &  cls'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rados/load-gen-big.sh
  - print: '**** done rados/load-gen-big.sh'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rbd/test_librbd.sh
  - print: '**** done rbd/test_librbd.sh'
  - workunit:
      branch: dumpling
      clients:
        client.0:
        - rbd/test_librbd_python.sh
  - print: '**** done rbd/test_librbd_python.sh'
workload2:
  sequential:
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rados/test.sh
        - cls
  - print: '**** done #rados/test.sh and cls 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rados/load-gen-big.sh
  - print: '**** done rados/load-gen-big.sh 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd.sh
  - print: '**** done rbd/test_librbd.sh 2'
  - workunit:
      branch: firefly
      clients:
        client.0:
        - rbd/test_librbd_python.sh
  - print: '**** done rbd/test_librbd_python.sh 2'
client.0-kernel-sha1: 967166011221589288348b893720d358150176b9
description: upgrade:dumpling-firefly-x/parallel/{0-cluster/start.yaml 1-dumpling-install/dumpling.yaml
  2-workload/{rados_api.yaml rados_loadgenbig.yaml test_rbd_api.yaml test_rbd_python.yaml}
  3-firefly-upgrade/firefly.yaml 4-workload/{rados_api.yaml rados_loadgenbig.yaml
  test_rbd_api.yaml test_rbd_python.yaml} 5-upgrade-sequence/upgrade-by-daemon.yaml
  6-final-workload/{ec-readwrite.yaml rados-snaps-few-objects.yaml rados_loadgenmix.yaml
  rados_mon_thrash.yaml rbd_cls.yaml rbd_import_export.yaml rgw_s3tests.yaml rgw_swift.yaml}
  distros/ubuntu_12.04.yaml}
duration: 10086.872591018677
failure_reason: ''
flavor: basic
mon.a-kernel-sha1: 967166011221589288348b893720d358150176b9
mon.b-kernel-sha1: 967166011221589288348b893720d358150176b9
owner: scheduled_teuthology@teuthology
success: false
Actions #1

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to 12

For some reason it is trying to re-create a pool that already exists and fails

2014-08-05T21:33:16.761 INFO:teuthology.orchestra.run.vpm068.stderr:Error EINVAL: pool 'unique_pool_0' cannot change to type erasure

if it was a replicated pool it would silently succeed because it is idempotent. But trying to create a pool of a different type fails instead (which is good, IMHO ;-).

Actions #3

Updated by Sage Weil over 9 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Urgent

ceph osd pool create unique_pool_0 hung

Actions #4

Updated by Sage Weil over 9 years ago

    def create_pool_with_unique_name(self, pg_num=16, ec_pool=False, ec_m=1, ec_k=2):
        """ 
        Create a pool named unique_pool_X where X is unique.
        """ 
        name = "" 
        with self.lock:
            name = "unique_pool_%s" % (str(self.next_pool_id),)
            self.next_pool_id += 1
            self.create_pool(
                name,
                pg_num,
                ec_pool=ec_pool,
                ec_m=ec_m,
                ec_k=ec_k)
        return name

2014-08-05T21:33:04.619 INFO:teuthology.run_tasks:Running task rados...
2014-08-05T21:33:04.621 INFO:teuthology.task.rados:Beginning rados...
2014-08-05T21:33:04.621 INFO:teuthology.run_tasks:Running task rados...
2014-08-05T21:33:04.621 INFO:teuthology.task.rados:Beginning rados...

i think they raced, but my ignorant reading of that python says that it should give back unique pool names?

Actions #5

Updated by Sage Weil over 9 years ago

  • Assignee changed from Sage Weil to Loïc Dachary
Actions #6

Updated by Loïc Dachary over 9 years ago

The two rados tasks

  - rados:
      clients:
      - client.0
      ec_pool: true
      objects: 500
      op_weights:
        append: 45
        delete: 10
        read: 45
        write: 0
      ops: 4000
  - rados:
      clients:
      - client.1
      objects: 50
      op_weights:
        delete: 50
        read: 100
        rollback: 50
        snap_create: 50
        snap_remove: 50
        write: 100
      ops: 4000

are run by calling the task method which spawns a gevent thread which creates a CephManager object . The CephManager will later be used to create a pool .

The manager is stored in the ctx object which is common to all tasks. However, there may be a race condition since the tasks are run in paralell:

- parallel:
    - workload
    - upgrade-sequence

If the two tasks have a different CephManager object, both will create the unique_pool_0 object.

Actions #7

Updated by Loïc Dachary over 9 years ago

  • Project changed from Ceph to teuthology
  • Status changed from 12 to Fix Under Review
  • % Done changed from 0 to 80
Actions #8

Updated by Loïc Dachary over 9 years ago

There needs to be a real lock to protect the part of the code changing ctx. A lock is created in ctx and used by rados.py to ensure the uniqueness of the ctx.manager instance.

Actions #9

Updated by Loïc Dachary over 9 years ago

Alternative solution : initialize ctx.manager in ceph.py

Actions #10

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 80 to 100
Actions

Also available in: Atom PDF