Project

General

Profile

Bug #18014

upgrade:hammer-x fails with exception from ceph-objectstore-tool: exp list-pgs failure with status 127

Added by Tamilarasi muthamizhan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
11/23/2016
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x
Pull request ID:

Description

upgrade:hammer/x/stress-split fails with ceph-objectstore-tool: exp list-pgs failure with status 127

logs are available in tamil@teuthology:/a/tamil-2016-11-22_22:13:43-upgrade:hammer-x:stress-split-jewel-distro-basic-smithi/570248

config.yaml:

archive_path: /home/teuthworker/archive/tamil-2016-11-22_22:13:43-upgrade:hammer-x:stress-split-jewel-distro-basic-smithi/570248
branch: jewel
description: upgrade:hammer-x:stress-split/{0-tz-eastern.yaml 0-cluster/{openstack.yaml
  start.yaml} 1-hammer-install/hammer.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml
  4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml}
  6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-finish-upgrade/last-osds-and-monc.yaml
  9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}
email: ceph-qa@ceph.com
job_id: '570248'
kernel:
  kdb: true
  sha1: distro
last_in_suite: false
machine_type: smithi
name: tamil-2016-11-22_22:13:43-upgrade:hammer-x:stress-split-jewel-distro-basic-smithi
nuke-on-error: true
openstack:
- machine:
    disk: 40
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: jewel
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 25
    log-whitelist:
    - slow request
    - wrongly marked me down
    - objects unfound and apparently lost
    - log bound mismatch
    - soft lockup
    - detected stalls on CPUs
    - failed to encode map e
    sha1: 427f357f0eed32c9ce17590ae9303a94e8b710e7
  ceph-deploy:
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 427f357f0eed32c9ce17590ae9303a94e8b710e7
  workunit:
    sha1: 427f357f0eed32c9ce17590ae9303a94e8b710e7
owner: scheduled_tamil@teuthology
priority: 1000
roles:
- - mon.a
  - mon.b
  - mon.c
  - mds.a
  - osd.0
  - osd.1
  - osd.2
- - osd.3
  - osd.4
  - osd.5
- - client.0
sha1: 427f357f0eed32c9ce17590ae9303a94e8b710e7
suite: upgrade:hammer-x:stress-split
suite_branch: wip-whitelist-crc
suite_path: /home/teuthworker/src/ceph-qa-suite_wip-whitelist-crc
suite_sha1: 4a0f87a62f4e6d778019aa86bab8f427688d6256
tasks:
- exec:
    all:
    - echo America/New_York | sudo tee /etc/timezone
- install:
    branch: hammer
- print: '**** done install hammer'
- ceph:
    fs: xfs
- print: '**** done ceph'
- install.upgrade:
    osd.0: null
- print: '**** done install.upgrade osd.0'
- ceph.restart:
    daemons:
    - osd.0
    - osd.1
    - osd.2
- print: '**** done ceph.restart 1st half'
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    sighup_delay: 0
    timeout: 1200
- print: '**** done thrashosds 3-thrash'
- ceph.restart:
    daemons:
    - mon.a
    wait-for-healthy: false
    wait-for-osds-up: true
- print: '**** done ceph.restart mon.a'
- workunit:
    branch: hammer
    clients:
      client.0:
      - cls/test_cls_rbd.sh
- print: '**** done cls/test_cls_rbd.sh 5-workload'
- workunit:
    branch: hammer
    clients:
      client.0:
      - rbd/import_export.sh
    env:
      RBD_CREATE_ARGS: --new-format
- print: '**** done rbd/import_export.sh 5-workload'
- full_sequential:
  - rados:
      clients:
      - client.0
      objects: 500
      op_weights:
        delete: 10
        read: 45
        write: 45
      ops: 4000
      write_append_excl: false
- print: '**** done rados/readwrite 5-workload'
- full_sequential:
  - rados:
      clients:
      - client.0
      objects: 50
      op_weights:
        delete: 50
        read: 100
        rollback: 50
        snap_create: 50
        snap_remove: 50
        write: 100
      ops: 4000
      write_append_excl: false
- print: '**** done rados/snaps-few-objects 5-workload'
- ceph.restart:
    daemons:
    - mon.b
    wait-for-healthy: false
    wait-for-osds-up: true
- print: '**** done ceph.restart mon.b 6-next-mon'
- full_sequential:
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
  - radosbench:
      clients:
      - client.0
      time: 150
- print: '**** done radosbench 7-workload'
- workunit:
    branch: hammer
    clients:
      client.0:
      - rbd/test_librbd.sh
- print: '**** done rbd/test_librbd.sh 7-workload'
- install.upgrade:
    osd.3: null
- ceph.restart:
    daemons:
    - osd.3
    - osd.4
    - osd.5
    wait-for-healthy: false
    wait-for-osds-up: true
- sleep:
    duration: 10
- ceph.restart:
    daemons:
    - mon.c
    wait-for-healthy: false
    wait-for-osds-up: true
- print: '**** done ceph.restart mon.c 8-next-mon'
- ceph.wait_for_mon_quorum:
  - a
  - b
  - c
- exec:
    osd.0:
    - sleep 300
    - ceph osd set require_jewel_osds
- print: '**** done wait_for_mon_quorum 8-next-mon'
- workunit:
    branch: hammer
    clients:
      client.0:
      - rbd/test_librbd_python.sh
- print: '**** done rbd/test_librbd_python.sh 9-workload'
- rgw:
    client.0: null
    default_idle_timeout: 300
- print: '**** done rgw 9-workload'
- swift:
    client.0:
      rgw_server: client.0
- print: '**** done swift 9-workload'
- rados:
    clients:
    - client.0
    objects: 500
    op_weights:
      delete: 50
      read: 100
      rollback: 50
      snap_create: 50
      snap_remove: 50
      write: 100
    ops: 4000
    write_append_excl: false
teuthology_branch: master
tube: smithi
verbose: true
worker_log: /home/teuthworker/archive/worker_logs/worker.smithi.8782

Related issues

Copied to ceph-qa-suite - Backport #18099: jewel: upgrade:hammer-x fails with exception from ceph-objectstore-tool: exp list-pgs failure with status 127 Resolved

History

#1 Updated by Yuri Weinstein over 2 years ago

2016-11-23T04:25:34.641 INFO:teuthology.orchestra.run.smithi038.stdout:Unpacking ceph-base (10.2.3-358-g427f357-1trusty) ...
2016-11-23T04:25:35.756 INFO:tasks.ceph.osd.4:Stopped
2016-11-23T04:25:35.756 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd
2016-11-23T04:25:35.757 INFO:teuthology.orchestra.run.smithi038:Running: 'sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --log-file=/var/log/ceph/objectstore_tool.\\$pid.log --op list-pgs'
2016-11-23T04:25:35.770 INFO:teuthology.orchestra.run.smithi038.stderr:/usr/bin/adjust-ulimits: 16: exec: ceph-objectstore-tool: not found
2016-11-23T04:25:35.772 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/ceph-qa-suite_wip-whitelist-crc/tasks/ceph_manager.py", line 660, in wrapper
    return func(self)
  File "/home/teuthworker/src/ceph-qa-suite_wip-whitelist-crc/tasks/ceph_manager.py", line 714, in do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/ceph-qa-suite_wip-whitelist-crc/tasks/ceph_manager.py", line 217, in kill_osd
    format(ret=proc.exitstatus))
Exception: ceph-objectstore-tool: exp list-pgs failure with status 127

2016-11-23T04:25:35.773 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6397)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 594, in print_exception
    traceback.print_exception(type_, value, tb, file=errstream)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-11-23T04:25:35.773 CRITICAL:root:IOError

#2 Updated by Samuel Just over 2 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Nathan Cutler over 2 years ago

exec: ceph-objectstore-tool: not found

#4 Updated by Nathan Cutler over 2 years ago

  • Status changed from New to Need More Info

I see what is happening here: thrashosds is racing with install.upgrade in an interesting way.

0.94.9, which contains the ceph-objectstore-tool executable, is installed on all test nodes at the beginning of the test. However, install.upgrade works by first removing the old version ("apt-get remove") and then installing ("apt-get install") the new one. It does not do "apt-get upgrade".

osd.4, where the traceback occurs, is running on smithi038. osd.3 is also on this node.

While thrashosds is ongoing, the test enters "install.upgrade: osd.3: null" which causes the hammer package to be removed on this node. Then the "apt-get install" command starts to run as part of the process of upgrading to jewel. Just after ceph-base is installed but before ceph-osd (which contains the ceph-objectstore-tool executable in jewel) is installed, the thrashosds task tries to run ceph-objectstore-tool on osd.4 and fails with:

exec: ceph-objectstore-tool: not found

@Tamilarasi, @Yuri: does this failure happen every time you run the test? Since the failure appears to be caused by a race condition, I would think not.

#5 Updated by Tamilarasi muthamizhan over 2 years ago

  • Status changed from Need More Info to New
  • Priority changed from Urgent to Normal

latest tests passed - teuthology:/a/tamil-2016-11-23_19:13:38-upgrade:hammer-x:stress-split-jewel-distro-basic-smithi

#6 Updated by Tamilarasi muthamizhan over 2 years ago

@nathan, this doesnt happen every time. it seems to be a racing condition [thats inconsistent].
latest runs passed.

#7 Updated by Tamilarasi muthamizhan over 2 years ago

  • Priority changed from Normal to Urgent

#8 Updated by Nathan Cutler over 2 years ago

  • Assignee set to Nathan Cutler

#9 Updated by Nathan Cutler over 2 years ago

  • Backport set to hammer,jewel

#10 Updated by Nathan Cutler over 2 years ago

  • Status changed from New to Need Review

#11 Updated by Sage Weil over 2 years ago

  • Status changed from Need Review to Pending Backport
  • Backport changed from hammer,jewel to jewel

#13 Updated by Nathan Cutler over 2 years ago

  • Project changed from Ceph to ceph-qa-suite

#14 Updated by Loic Dachary over 2 years ago

  • Copied to Backport #18099: jewel: upgrade:hammer-x fails with exception from ceph-objectstore-tool: exp list-pgs failure with status 127 added

#15 Updated by Nathan Cutler over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF