Bug #14927
closed"[ FAILED ] ClsLock.Test*" (4 tests) failed in upgrade:hammer-x-jewel-distro-basic-openstack
0%
Description
Run: http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/
Jobs: 7199, 7207
Logs: http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2016-02-29_10:02:04-upgrade:hammer-x-jewel-distro-basic-openstack/7207/teuthology.log
2016-02-29T10:40:29.431 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN ] ClsLock.TestCookie 2016-02-29T10:40:30.473 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:220: Failure 2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster) 2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout: Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-3) failed with error -22" 2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 2016-02-29T10:40:30.474 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestCookie (1042 ms) 2016-02-29T10:40:30.475 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN ] ClsLock.TestMultipleLocks 2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:256: Failure 2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster) 2016-02-29T10:40:31.526 INFO:tasks.workunit.client.0.target067194.stdout: Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-4) failed with error -22" 2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestMultipleLocks (1052 ms) 2016-02-29T10:40:31.527 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN ] ClsLock.TestLockDuration 2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:278: Failure 2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster) 2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout: Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-5) failed with error -22" 2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 2016-02-29T10:40:32.714 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestLockDuration (1189 ms) 2016-02-29T10:40:32.715 INFO:tasks.workunit.client.0.target067194.stdout:[ RUN ] ClsLock.TestAssertLocked 2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:test/cls_lock/test_cls_lock.cc:305: Failure 2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout:Value of: create_one_pool_pp(pool_name, cluster) 2016-02-29T10:40:33.729 INFO:tasks.workunit.client.0.target067194.stdout: Actual: "cluster.pool_create(test-rados-api-target067194.ovh.sepia.ceph.com-12324-6) failed with error -22" 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:Expected: "" 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestAssertLocked (1015 ms) 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] 6 tests from ClsLock (10597 ms total) 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout: 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[----------] Global test environment tear-down 2016-02-29T10:40:33.730 INFO:tasks.workunit.client.0.target067194.stdout:[==========] 6 tests from 1 test case ran. (10597 ms total) 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ PASSED ] 2 tests. 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] 4 tests, listed below: 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestCookie 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestMultipleLocks 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestLockDuration 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout:[ FAILED ] ClsLock.TestAssertLocked 2016-02-29T10:40:33.731 INFO:tasks.workunit.client.0.target067194.stdout: 2016-02-29T10:40:33.732 INFO:tasks.workunit.client.0.target067194.stdout: 4 FAILED TESTS
Updated by Yuri Weinstein about 8 years ago
Updated by Yuri Weinstein about 8 years ago
- Related to Bug #14935: "[ FAILED ] ClsHello.Filter" in upgrade:infernalis-x-jewel-distro-basic-vps added
Updated by Yuri Weinstein about 8 years ago
From IRC:
yuriw: actually, perhaps it's just a race condition on package install jdillaman yuriw: it failed creating a pool, so maybe the crushmap tool was partially installed jdillaman @ 9:31 yuriw - we started seeing those after changes in packages, maybe be yuriw @ 9:31 jdillaman ... so most likely the same issue
Updated by Nathan Cutler about 8 years ago
The ceph SHA1 for this job is "d6031c2 Merge pull request #7700 from SUSE/wip-14811"
As of that SHA1 the rados-classes libraries were not packaged correctly. This issue was later fixed by https://github.com/ceph/ceph/commit/5083980f4619bf282dfecde5078f1a73a2ed562a
I don't know for sure whether that commit fixed this bug, but might be worth rerunning these tests on latest jewel.
Updated by Yuri Weinstein about 8 years ago
On pre-packadges split code tests passed http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-01_23:53:26-upgrade:hammer-x-wip-jewel-no-package-split-distro-basic-openstack/
Updated by Yuri Weinstein about 8 years ago
side-by-side runs before and after the split code, same test
Updated by Yuri Weinstein about 8 years ago
Updated by Nathan Cutler about 8 years ago
@Yuri, so the test passes on latest jewel, but fails on wip-jewel-no-package-split? Weird.
Updated by Yuri Weinstein about 8 years ago
@Nathan Weinberg I created a little noise in this ticket, my apologies.
Now on plan of attack.
Dan and I ran several simplified tests rados
:clsonly@ and they all passed:
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_18:11:19-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-03-07_17:57:03-rados:clsonly-jewel-distro-basic-openstack/
http://pulpito.ceph.com/teuthology-2016-03-07_10:11:02-rados:clsonly-jewel-distro-basic-smithi/
http://pulpito.ceph.com/teuthology-2016-03-07_10:12:13-rados:clsonly-jewel-distro-basic-vps/
So I think this indicates to us that the problem is really in upgrade part.
I think it'd be useful to simplify the faulty upgrade job, leave only cls
tests, shorten run time and reproduce and then maybe add more logging for troubleshooting.
Updated by Yuri Weinstein about 8 years ago
Latest test with errors failed with error -22
http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-07_02:10:10-upgrade:hammer-x-jewel-distro-basic-vps/44586/teuthology.log
Updated by Yuri Weinstein about 8 years ago
Here is a simplified upgrade test that can be used to reproduce the issue (can run on teuthology box):
ubuntu@teuthology:~$ cat ~/yuriw/_run_tests.yaml interactive-on-error: true kernel: kdb: true sha1: distro nuke-on-error: true os_type: ubuntu os_version: '14.04' overrides: admin_socket: branch: jewel ceph: conf: global: mon lease: 15 mon lease ack timeout: 25 osd heartbeat grace: 100 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - scrub mismatch - ScrubResult - failed to encode map - wrongly marked me down - soft lockup - detected stalls on CPUs sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d ceph-deploy: branch: dev-commit: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 ceph-fuse: client.0: mount_timeout: 120 mount_wait: 60 install: ceph: sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d rgw: default_idle_timeout: 1200 s3tests: idle_timeout: 1200 workunit: sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d priority: 100 roles: - - mon.a - mds.a - osd.0 - osd.1 - - mon.b - mon.c - osd.2 - osd.3 - - client.0 - client.1 sha1: 2c23b6a9d3da77a8ea423b9e578a4f2cfc1f8d3d suite: upgrade:hammer-x:parallel suite_branch: wip-testing-14927-jewel suite_path: /var/lib/teuthworker/src/ceph-qa-suite_wip-testing-14927-jewel suite_sha1: a400d00a73d2bf5eac67304b475f46a99a6bd41f tasks: - exec: all: - echo America/New_York | sudo tee /etc/timezone - install: branch: hammer - print: '**** done installing hammer' - ceph: fs: xfs - print: '**** done ceph' - parallel: - workload - upgrade-sequence - print: '**** done parallel' - install.upgrade: client.0: null - exec: osd.0: - ceph osd set sortbitwise - for p in `ceph osd pool ls` ; do ceph osd pool set $p use_gmt_hitset true ; done - workunit: clients: client.1: - cls/test_cls_rbd.sh - print: '**** done cls/test_cls_rbd.sh 4-final-workload' teuthology_branch: master tube: smithi upgrade-sequence: sequential: - install.upgrade: mon.a: null mon.b: null - print: '**** done install.upgrade mon.a and mon.b' - ceph.restart: - mon.a - mon.b - mon.c - mds.a - osd.0 - osd.1 - osd.2 - osd.3 - print: '**** done ceph.restart all' verbose: true worker_log: /var/lib/teuthworker/archive/worker_logs/worker.smithi.21695 workload: sequential: - workunit: branch: hammer clients: client.0: - cls - print: '**** done cls 2-workload'
Per IRC discussion with Josh and Dan:
the next thing I'd do is add logging to the monitor showing all the pool create parameters and before all the returns in OSDMonitor::prepare_new_pool() so we can figure out at least where the EINVAL is coming from joshd1 @ 2:53 existing debug statements don't have enough detail for that ...... we need to change the ceph code
Updated by Samuel Just about 8 years ago
- Priority changed from Normal to Urgent
Updated by Sage Weil about 8 years ago
- Is duplicate of Bug #14849: (infernalis) [ FAILED ] LibRadosAioEC.ReturnValuePP in rados-infernalis-distro-basic-openstack added
Updated by Nathan Cutler about 8 years ago
- Status changed from Duplicate to 12
The bug seems to be fixed in infernalis now, based on running rados/thrash/{hobj-sort.yaml 0-size-min-size-overrides/3-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/ext4.yaml msgr/simple.yaml msgr-failures/osd-delay.yaml thrashers/default.yaml workloads/rados_api_tests.yaml}
on the infernalis-pr-7987 branch in https://github.com/ceph/ceph/pull/7987
However, based on testing done in https://github.com/ceph/ceph/pull/8052 the bug still manifests in upgrade/hammer-x
against a ceph-qa-suite jewel branch modified to install the patched hammer packages.
I'm in over my head so I'll stop talking now.
Updated by Sage Weil about 8 years ago
- Status changed from 12 to Resolved
fixed by ceph-qa-suite.git fcde1f6b9f1357e799fe65ce9baaf4b4bb517fbf
Updated by Yuri Weinstein about 8 years ago
- Related to Bug #15423: "[ FAILED ] ClsLock.TestLockDuration" in upgrade:firefly-x-hammer-distro-basic-openstack added