Bug #14927: "[ FAILED ] ClsLock.Test*" (4 tests) failed in upgrade:hammer-x-jewel-distro-basic-openstack - Ceph - Ceph

Actions

Copy link

Bug #14927

closed

"[ FAILED ] ClsLock.Test*" (4 tests) failed in upgrade:hammer-x-jewel-distro-basic-openstack

Added by Yuri Weinstein about 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

upgrade/hammer-x

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/
Jobs: 7199, 7207
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/7207/teuthology.log

2016-02-29T10:40:29.431 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestCookie
2016-02-29T10:40:30.473 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:220: Failure
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-3) failed with error -22" 
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestCookie (1042 ms)
2016-02-29T10:40:30.475 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestMultipleLocks
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:256: Failure
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-4) failed with error -22" 
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestMultipleLocks (1052 ms)
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestLockDuration
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:278: Failure
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-5) failed with error -22" 
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestLockDuration (1189 ms)
2016-02-29T10:40:32.715 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestAssertLocked
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:305: Failure
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-6) failed with error -22" 
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestAssertLocked (1015 ms)
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] 6 tests from ClsLock (10597 ms total)
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] Global test environment tear-down
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[==========] 6 tests from 1 test case ran. (10597 ms total)
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  PASSED  ] 2 tests.
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] 4 tests, listed below:
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestCookie
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestMultipleLocks
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestLockDuration
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestAssertLocked
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:
2016-02-29T10:40:33.732 INFO:tasks.workunit.client.0.target067194.stdout: 4 FAILED TESTS

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Also in run:
http://pulpito.ceph.com/teuthology-2016-02-28_02:10:01-upgrade:hammer-x-jewel-distro-basic-vps/31527/
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-02-28_02:10:01-upgrade:hammer-x-jewel-distro-basic-vps/31527/teuthology.log

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Related to Bug #14935: "[ FAILED ] ClsHello.Filter" in upgrade:infernalis-x-jewel-distro-basic-vps added

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

From IRC:

yuriw: actually, perhaps it's just a race condition on package install jdillaman
yuriw: it failed creating a pool, so maybe the crushmap tool was partially installed jdillaman @ 9:31    
yuriw - we started seeing  those after changes in packages, maybe be yuriw @ 9:31    
jdillaman ... so most likely the same issue

Actions

Copy link

Updated by Nathan Cutler about 8 years ago

The ceph SHA1 for this job is "d6031c2 Merge pull request #7700 from SUSE/wip-14811"

As of that SHA1 the rados-classes libraries were not packaged correctly. This issue was later fixed by https://github.com/ceph/ceph/commit/5083980f4619bf282dfecde5078f1a73a2ed562a

I don't know for sure whether that commit fixed this bug, but might be worth rerunning these tests on latest jewel.

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

On pre-packadges split code tests passed http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-01_23:53:26-upgrade:hammer-x-wip-jewel-no-package-split-distro-basic-openstack/

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

Last run on jewel: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-03-02_02:52:36-upgrade:hammer-x-jewel-distro-basic-openstack/8041/teuthology.log

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

side-by-side runs before and after the split code, same test

http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-02_19:34:51-upgrade:hammer-x-wip-jewel-no-package-split-distro-basic-openstack/

http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-02_19:33:38-upgrade:hammer-x-jewel-distro-basic-openstack/

Actions

Copy link

Updated by Yuri Weinstein about 8 years ago

run on wip-jewel-no-package-split
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-02_19:34:51-upgrade:hammer-x-wip-jewel-no-package-split-distro-basic-openstack/

Actions

Copy link

Updated by Nathan Cutler about 8 years ago

@Yuri, so the test passes on latest jewel, but fails on wip-jewel-no-package-split? Weird.

Actions

Copy link

#10

Updated by Yuri Weinstein about 8 years ago

@Nathan Weinberg I created a little noise in this ticket, my apologies.

Now on plan of attack.

Dan and I ran several simplified tests rados:clsonly@ and they all passed:
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_18:11:19-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_17:57:03-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ceph.com/teuthology-2016-03-07_10:11:02-rados:clsonly-jewel-distro-basic-smithi/
http://pulpito.ceph.com/teuthology-2016-03-07_10:12:13-rados:clsonly-jewel-distro-basic-vps/

So I think this indicates to us that the problem is really in upgrade part.
I think it'd be useful to simplify the faulty upgrade job, leave only cls tests, shorten run time and reproduce and then maybe add more logging for troubleshooting.

Actions

Copy link

#11

Updated by Yuri Weinstein about 8 years ago

Latest test with errors failed with error -22
http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-07_02:10:10-upgrade:hammer-x-jewel-distro-basic-vps/44586/teuthology.log

Actions

Copy link

#12

Updated by Yuri Weinstein about 8 years ago

Here is a simplified upgrade test that can be used to reproduce the issue (can run on teuthology box):

ubuntu@teuthology:~$ cat ~/yuriw/_run_tests.yaml 
interactive-on-error: true

kernel:
  kdb: true
  sha1: distro

nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: jewel
  ceph:
    conf:
      global:
        mon lease: 15
        mon lease ack timeout: 25
        osd heartbeat grace: 100
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    - failed to encode map
    - wrongly marked me down
    - soft lockup
    - detected stalls on CPUs
    sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
  ceph-deploy:
    branch:
      dev-commit: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  ceph-fuse:
    client.0:
      mount_timeout: 120
      mount_wait: 60
  install:
    ceph:
      sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
  rgw:
    default_idle_timeout: 1200
  s3tests:
    idle_timeout: 1200
  workunit:
    sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
priority: 100
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
suite: upgrade:hammer-x:parallel
suite_branch: wip-testing-14927-jewel
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_wip-testing-14927-jewel
suite_sha1: a400d00a73d2bf5eac67304b475f46a99a6bd41f
tasks:
- exec:
    all:
    - echo America/New_York | sudo tee /etc/timezone
- install:
    branch: hammer
- print: '**** done installing hammer'
- ceph:
    fs: xfs
- print: '**** done ceph'
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- exec:
    osd.0:
    - ceph osd set sortbitwise
    - for p in `ceph osd pool ls` ; do ceph osd pool set $p use_gmt_hitset true ;
      done
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
- print: '**** done cls/test_cls_rbd.sh 4-final-workload'
teuthology_branch: master
tube: smithi
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
      mon.b: null
  - print: '**** done install.upgrade mon.a and mon.b'
  - ceph.restart:
    - mon.a
    - mon.b
    - mon.c
    - mds.a
    - osd.0
    - osd.1
    - osd.2
    - osd.3
  - print: '**** done ceph.restart all'
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.smithi.21695
workload:
  sequential:
  - workunit:
      branch: hammer
      clients:
        client.0:
        - cls
  - print: '**** done cls 2-workload'

Per IRC discussion with Josh and Dan:

the next thing I'd do is add logging to the monitor showing all the pool create parameters and before all the returns in OSDMonitor::prepare_new_pool() so we can figure out at least where the EINVAL is coming from joshd1 @ 2:53
existing debug statements don't have enough detail for that
......
we need to change the ceph code

Actions

Copy link

#13

Updated by Samuel Just about 8 years ago

Priority changed from Normal to Urgent

Actions

Copy link

#14

Updated by Sage Weil about 8 years ago

Is duplicate of Bug #14849: (infernalis) [ FAILED ] LibRadosAioEC.ReturnValuePP in rados-infernalis-distro-basic-openstack added

Actions

Copy link

#15

Updated by Sage Weil about 8 years ago

Status changed from New to Duplicate

Actions

Copy link

#16

Updated by Nathan Cutler about 8 years ago

Status changed from Duplicate to 12

The bug seems to be fixed in infernalis now, based on running rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/ext4.yaml msgr/simple.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/rados_api_tests.yaml} on the infernalis-pr-7987 branch in https://github.com/ceph/ceph/pull/7987

However, based on testing done in https://github.com/ceph/ceph/pull/8052 the bug still manifests in upgrade/hammer-x against a ceph-qa-suite jewel branch modified to install the patched hammer packages.

I'm in over my head so I'll stop talking now.

Actions

Copy link

#17

Updated by Sage Weil about 8 years ago

Status changed from 12 to Resolved

fixed by ceph-qa-suite.git fcde1f6b9f1357e799fe65ce9baaf4b4bb517fbf

Actions

Copy link

#18

Updated by Yuri Weinstein about 8 years ago

Related to Bug #15423: "[ FAILED ] ClsLock.TestLockDuration" in upgrade:firefly-x-hammer-distro-basic-openstack added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14927

"[ FAILED ] ClsLock.Test*" (4 tests) failed in upgrade:hammer-x-jewel-distro-basic-openstack

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Nathan Cutler about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Nathan Cutler about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Yuri Weinstein about 8 years ago

Updated by Samuel Just about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Nathan Cutler about 8 years ago

Updated by Sage Weil about 8 years ago

Updated by Yuri Weinstein about 8 years ago