Project

General

Profile

Actions

Bug #14927

closed

"[ FAILED ] ClsLock.Test*" (4 tests) failed in upgrade:hammer-x-jewel-distro-basic-openstack

Added by Yuri Weinstein about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/hammer-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/
Jobs: 7199, 7207
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/7207/teuthology.log

2016-02-29T10:40:29.431 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestCookie
2016-02-29T10:40:30.473 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:220: Failure
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-3) failed with error -22" 
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestCookie (1042 ms)
2016-02-29T10:40:30.475 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestMultipleLocks
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:256: Failure
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-4) failed with error -22" 
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestMultipleLocks (1052 ms)
2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestLockDuration
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:278: Failure
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-5) failed with error -22" 
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestLockDuration (1189 ms)
2016-02-29T10:40:32.715 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN      ] ClsLock.TestAssertLocked
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:305: Failure
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster)
2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:  Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-6) failed with error -22" 
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestAssertLocked (1015 ms)
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] 6 tests from ClsLock (10597 ms total)
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] Global test environment tear-down
2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[==========] 6 tests from 1 test case ran. (10597 ms total)
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  PASSED  ] 2 tests.
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] 4 tests, listed below:
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestCookie
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestMultipleLocks
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestLockDuration
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[  FAILED  ] ClsLock.TestAssertLocked
2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:
2016-02-29T10:40:33.732 INFO:tasks.workunit.client.0.target067194.stdout: 4 FAILED TESTS


Related issues 3 (0 open3 closed)

Related to Ceph - Bug #14935: "[ FAILED ] ClsHello.Filter" in upgrade:infernalis-x-jewel-distro-basic-vpsDuplicate03/01/2016

Actions
Related to Ceph - Bug #15423: "[ FAILED ] ClsLock.TestLockDuration" in upgrade:firefly-x-hammer-distro-basic-openstackWon't Fix

Actions
Is duplicate of Ceph - Bug #14849: (infernalis) [ FAILED ] LibRadosAioEC.ReturnValuePP in rados-infernalis-distro-basic-openstackDuplicate02/23/2016

Actions
Actions #2

Updated by Yuri Weinstein about 8 years ago

  • Related to Bug #14935: "[ FAILED ] ClsHello.Filter" in upgrade:infernalis-x-jewel-distro-basic-vps added
Actions #3

Updated by Yuri Weinstein about 8 years ago

From IRC:

yuriw: actually, perhaps it's just a race condition on package install jdillaman
yuriw: it failed creating a pool, so maybe the crushmap tool was partially installed jdillaman @ 9:31    
yuriw - we started seeing  those after changes in packages, maybe be yuriw @ 9:31    
jdillaman ... so most likely the same issue
Actions #4

Updated by Nathan Cutler about 8 years ago

The ceph SHA1 for this job is "d6031c2 Merge pull request #7700 from SUSE/wip-14811"

As of that SHA1 the rados-classes libraries were not packaged correctly. This issue was later fixed by https://github.com/ceph/ceph/commit/5083980f4619bf282dfecde5078f1a73a2ed562a

I don't know for sure whether that commit fixed this bug, but might be worth rerunning these tests on latest jewel.

Actions #9

Updated by Nathan Cutler about 8 years ago

@Yuri, so the test passes on latest jewel, but fails on wip-jewel-no-package-split? Weird.

Actions #10

Updated by Yuri Weinstein about 8 years ago

@Nathan Weinberg I created a little noise in this ticket, my apologies.

Now on plan of attack.

Dan and I ran several simplified tests rados:clsonly@ and they all passed:
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_18:11:19-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_17:57:03-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ceph.com/teuthology-2016-03-07_10:11:02-rados:clsonly-jewel-distro-basic-smithi/
http://pulpito.ceph.com/teuthology-2016-03-07_10:12:13-rados:clsonly-jewel-distro-basic-vps/

So I think this indicates to us that the problem is really in upgrade part.
I think it'd be useful to simplify the faulty upgrade job, leave only cls tests, shorten run time and reproduce and then maybe add more logging for troubleshooting.

Actions #12

Updated by Yuri Weinstein about 8 years ago

Here is a simplified upgrade test that can be used to reproduce the issue (can run on teuthology box):

ubuntu@teuthology:~$ cat ~/yuriw/_run_tests.yaml 
interactive-on-error: true

kernel:
  kdb: true
  sha1: distro

nuke-on-error: true
os_type: ubuntu
os_version: '14.04'
overrides:
  admin_socket:
    branch: jewel
  ceph:
    conf:
      global:
        mon lease: 15
        mon lease ack timeout: 25
        osd heartbeat grace: 100
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
    - failed to encode map
    - wrongly marked me down
    - soft lockup
    - detected stalls on CPUs
    sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
  ceph-deploy:
    branch:
      dev-commit: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  ceph-fuse:
    client.0:
      mount_timeout: 120
      mount_wait: 60
  install:
    ceph:
      sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
  rgw:
    default_idle_timeout: 1200
  s3tests:
    idle_timeout: 1200
  workunit:
    sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
priority: 100
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
  - client.1
sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d
suite: upgrade:hammer-x:parallel
suite_branch: wip-testing-14927-jewel
suite_path: /var/lib/teuthworker/src/ceph-qa-suite_wip-testing-14927-jewel
suite_sha1: a400d00a73d2bf5eac67304b475f46a99a6bd41f
tasks:
- exec:
    all:
    - echo America/New_York | sudo tee /etc/timezone
- install:
    branch: hammer
- print: '**** done installing hammer'
- ceph:
    fs: xfs
- print: '**** done ceph'
- parallel:
  - workload
  - upgrade-sequence
- print: '**** done parallel'
- install.upgrade:
    client.0: null
- exec:
    osd.0:
    - ceph osd set sortbitwise
    - for p in `ceph osd pool ls` ; do ceph osd pool set $p use_gmt_hitset true ;
      done
- workunit:
    clients:
      client.1:
      - cls/test_cls_rbd.sh
- print: '**** done cls/test_cls_rbd.sh 4-final-workload'
teuthology_branch: master
tube: smithi
upgrade-sequence:
  sequential:
  - install.upgrade:
      mon.a: null
      mon.b: null
  - print: '**** done install.upgrade mon.a and mon.b'
  - ceph.restart:
    - mon.a
    - mon.b
    - mon.c
    - mds.a
    - osd.0
    - osd.1
    - osd.2
    - osd.3
  - print: '**** done ceph.restart all'
verbose: true
worker_log: /var/lib/teuthworker/archive/worker_logs/worker.smithi.21695
workload:
  sequential:
  - workunit:
      branch: hammer
      clients:
        client.0:
        - cls
  - print: '**** done cls 2-workload'

Per IRC discussion with Josh and Dan:

the next thing I'd do is add logging to the monitor showing all the pool create parameters and before all the returns in OSDMonitor::prepare_new_pool() so we can figure out at least where the EINVAL is coming from joshd1 @ 2:53
existing debug statements don't have enough detail for that
......
we need to change the ceph code
Actions #13

Updated by Samuel Just about 8 years ago

  • Priority changed from Normal to Urgent
Actions #14

Updated by Sage Weil about 8 years ago

  • Is duplicate of Bug #14849: (infernalis) [ FAILED ] LibRadosAioEC.ReturnValuePP in rados-infernalis-distro-basic-openstack added
Actions #15

Updated by Sage Weil about 8 years ago

  • Status changed from New to Duplicate
Actions #16

Updated by Nathan Cutler about 8 years ago

  • Status changed from Duplicate to 12

The bug seems to be fixed in infernalis now, based on running rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/ext4.yaml msgr/simple.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/rados_api_tests.yaml} on the infernalis-pr-7987 branch in https://github.com/ceph/ceph/pull/7987

However, based on testing done in https://github.com/ceph/ceph/pull/8052 the bug still manifests in upgrade/hammer-x against a ceph-qa-suite jewel branch modified to install the patched hammer packages.

I'm in over my head so I'll stop talking now.

Actions #17

Updated by Sage Weil about 8 years ago

  • Status changed from 12 to Resolved

fixed by ceph-qa-suite.git fcde1f6b9f1357e799fe65ce9baaf4b4bb517fbf

Actions #18

Updated by Yuri Weinstein about 8 years ago

  • Related to Bug #15423: "[ FAILED ] ClsLock.TestLockDuration" in upgrade:firefly-x-hammer-distro-basic-openstack added
Actions

Also available in: Atom PDF