Project

General

Profile

Bug #13279

upgrade suite: pool_create failed with error -4 EINTR

Added by Yuri Weinstein over 8 years ago. Updated over 8 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/firefly-hammer-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/
Job: ['1076297']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/1076297/teuthology.log

2015-09-29T11:13:07.293 INFO:tasks.ceph.osd.3:Restarting daemon
2015-09-29T11:13:07.293 INFO:tasks.ceph.osd.3:Stopping old one...
2015-09-29T11:13:07.293 DEBUG:tasks.ceph.osd.3:waiting for process to exit
2015-09-29T11:13:07.578 INFO:tasks.ceph.osd.2.vpm055.stdout:starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
2015-09-29T11:13:07.601 INFO:tasks.ceph.osd.2.vpm055.stderr:2015-09-29 18:13:07.593229 7fa603c65900 -1 filestore(/var/lib/ceph/osd/ceph-2) FileStore::mount : stale version stamp detected: 3. Proceeding, do_update is set, performing disk format upgrade.
2015-09-29T11:13:07.685 INFO:tasks.ceph.osd.2.vpm055.stderr:2015-09-29 18:13:07.680160 7fa603c65900 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2015-09-29T11:13:07.713 INFO:tasks.ceph.osd.2.vpm055.stderr:2015-09-29 18:13:07.707706 7fa603c65900 -1 osd.2 286 PGs are upgrading
2015-09-29T11:13:07.804 INFO:tasks.ceph.osd.2.vpm055.stderr:2015-09-29 18:13:07.799305 7fa603c65900 -1 osd.2 286 log_to_monitors {default=true}
2015-09-29T11:13:10.928 INFO:tasks.ceph.mon.b.vpm055.stderr:2015-09-29 18:13:10.921432 7f871fe56700 -1 mon.b@0(leader).mds e5 Missing health data for MDS 4112
2015-09-29T11:13:10.980 INFO:tasks.workunit.client.0.vpm026.stdout:test/librados/aio.cc:2817: Failure
2015-09-29T11:13:10.980 INFO:tasks.workunit.client.0.vpm026.stdout:Value of: test_data.init()
2015-09-29T11:13:10.980 INFO:tasks.workunit.client.0.vpm026.stdout:  Actual: "create_one_ec_pool(test-rados-api-vpm026-12090-64) failed: error mon_command osd pool create pool:test-rados-api-vpm026-12090-64 pool_type:erasure failed with error -4" 
2015-09-29T11:13:10.981 INFO:tasks.workunit.client.0.vpm026.stdout:Expected: "" 
2015-09-29T11:13:10.981 INFO:tasks.workunit.client.0.vpm026.stdout:[  FAILED  ] LibRadosAioEC.MultiWritePP (17363 ms)
2015-09-29T11:13:10.981 INFO:tasks.workunit.client.0.vpm026.stdout:[----------] 31 tests from LibRadosAioEC (294972 ms total)
2015-09-29T11:13:10.981 INFO:tasks.workunit.client.0.vpm026.stdout:
2015-09-29T11:13:10.981 INFO:tasks.workunit.client.0.vpm026.stdout:[----------] Global test environment tear-down
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout:[==========] 62 tests from 2 test cases ran. (421478 ms total)
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout:[  PASSED  ] 61 tests.
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout:[  FAILED  ] 1 test, listed below:
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout:[  FAILED  ] LibRadosAioEC.MultiWritePP
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout:
2015-09-29T11:13:10.982 INFO:tasks.workunit.client.0.vpm026.stdout: 1 FAILED TEST
2015-09-29T11:13:10.983 INFO:tasks.workunit:Stopping ['rados/test-upgrade-v9.0.1.sh', 'cls'] on client.0...
2015-09-29T11:13:10.983 INFO:teuthology.orchestra.run.vpm026:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0'

Related issues

Related to Ceph - Fix #12953: mon: cull / scrub unused crush rules when pools are deleted New
Related to Ceph - Bug #13664: tests: testprofile must be removed before it is re-created Resolved 10/31/2015
Duplicates Ceph - Backport #13401: mon: fix crush testing for new pools Resolved 10/07/2015

History

#1 Updated by Yuri Weinstein over 8 years ago

  • ceph-qa-suite upgrade/firefly-hammer-x added

#3 Updated by Yuri Weinstein over 8 years ago

  • Subject changed from "[ FAILED ] LibRadosAioEC.MultiWritePP" in upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps run to "[ FAILED ] LibRadosAioEC.*" tests in upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps run

#4 Updated by Yuri Weinstein over 8 years ago

  • Subject changed from "[ FAILED ] LibRadosAioEC.*" tests in upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps run to "[ FAILED ] LibRadosAioEC.*" tests failed in upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps run

#5 Updated by Yuri Weinstein over 8 years ago

  • Priority changed from Normal to Urgent

#6 Updated by Loïc Dachary over 8 years ago

  • Status changed from New to In Progress
  • Assignee set to Loïc Dachary

#7 Updated by Yuri Weinstein over 8 years ago

Loic I see similar in run, assuming a dupe for now:
http://pulpito.ceph.com/teuthology-2015-10-05_17:05:09-upgrade:giant-x-hammer-distro-basic-vps/
['1089918', '1089925', '1089958']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-10-05_17:05:09-upgrade:giant-x-hammer-distro-basic-vps/1089918/teuthology.log

2015-10-05T21:59:44.107 INFO:tasks.mon_thrash.ceph_manager:quorum is size 2
2015-10-05T21:59:44.107 INFO:teuthology.orchestra.run.vpm185:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.130.185:6789 mon_status'
2015-10-05T22:00:00.125 INFO:tasks.workunit.client.1.vpm050.stdout:test/librados/aio.cc:167: Failure
2015-10-05T22:00:00.125 INFO:tasks.workunit.client.1.vpm050.stdout:Value of: test_data.init()
2015-10-05T22:00:00.125 INFO:tasks.workunit.client.1.vpm050.stdout:  Actual: "create_one_pool(test-rados-api-vpm050-15351-1) failed: error rados_pool_create(test-rados-api-vpm050-15351-1) failed with error -4" 
2015-10-05T22:00:00.125 INFO:tasks.workunit.client.1.vpm050.stdout:Expected: "" 
2015-10-05T22:00:00.126 INFO:tasks.workunit.client.1.vpm050.stdout:[  FAILED  ] LibRadosAio.SimpleWrite (56419 ms)
.......

2015-10-06T06:38:26.649 INFO:tasks.workunit.client.1.vpm050.stdout:[  FAILED  ] 1 test, listed below:
2015-10-06T06:38:26.650 INFO:tasks.workunit.client.1.vpm050.stdout:[  FAILED  ] LibRadosAio.SimpleWrite
2015-10-06T06:38:26.650 INFO:tasks.workunit.client.1.vpm050.stdout:
2015-10-06T06:38:26.650 INFO:tasks.workunit.client.1.vpm050.stdout: 1 FAILED TEST

#8 Updated by Yuri Weinstein over 8 years ago

  • Release set to hammer
  • ceph-qa-suite upgrade/giant-x added

#9 Updated by Loïc Dachary over 8 years ago

  • Subject changed from "[ FAILED ] LibRadosAioEC.*" tests failed in upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps run to upgrade suite: pool_create failed with error -4 EINTR
  • Release deleted (hammer)
  • ceph-qa-suite deleted (upgrade/giant-x)

#10 Updated by Loïc Dachary over 8 years ago

The bug was fist seen early september, shortly after the following commits were merged in the hammer branch:

$ git log --merges --since 2015-09-01 --until 2015-09-11 --format='%H' ceph/hammer | while read sha1 ; do echo ; git log --format='** %aD "%s":https://github.com/ceph/ceph/commit/%H' ${sha1}^1..${sha1} ; done | perl -p -e 'print "* \"PR $1\":https://github.com/ceph/ceph/pull/$1\n" if(/Merge pull request #(\d+)/)'

#11 Updated by Loïc Dachary over 8 years ago

On the teuthology machine, in the /a directory,

for run in *2015-{05,06,07,08,09,10}*upgrade* ; do for job in $run/* ; do test -d $job || continue ; config=$job/config.yaml ; test -f $config || continue ; summary=$job/summary.yaml ; test -f $summary || continue ; if shyaml get-value branch < $config | grep -q hammer && shyaml get-value success < $summary | grep -qi false && grep -q 'error -4' $job/teuthology.log  ; then echo $job ; fi ; done ; done
teuthology-2015-09-11_17:18:07-upgrade:firefly-x-hammer-distro-basic-vps/1051109
teuthology-2015-09-14_17:18:07-upgrade:firefly-x-hammer-distro-basic-vps/1056283
teuthology-2015-09-25_17:18:08-upgrade:firefly-x-hammer-distro-basic-vps/1069265
teuthology-2015-10-02_17:05:01-upgrade:giant-x-hammer-distro-basic-vps/1084647
teuthology-2015-10-05_17:05:09-upgrade:giant-x-hammer-distro-basic-vps/1089918
teuthology-2015-10-05_17:05:09-upgrade:giant-x-hammer-distro-basic-vps/1089925

Shows the error appeared for the first time September 9th, 2015 with upgrade:firefly-x/stress-split/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-mon/mona.yaml 5-workload/{rbd-cls.yaml rbd-import-export.yaml readwrite.yaml snaps-few-objects.yaml} 6-next-mon/monb.yaml 7-workload/{radosbench.yaml rbd_api.yaml} 8-next-mon/monc.yaml 9-workload/{rbd-python.yaml rgw-swift.yaml snaps-many-objects.yaml} distros/ubuntu_14.04.yaml}

#12 Updated by Loïc Dachary over 8 years ago

$ git log --merges --since 2015-09-01 --until 2015-09-11 --format='%H' ceph/firefly | while read sha1 ; do echo ; git log --format='** %aD "%s":https://github.com/ceph/ceph/commit/%H' ${sha1}^1..${sha1} ; done | perl -p -e 'print "* \"PR $1\":https://github.com/ceph/ceph/pull/$1\n" if(/Merge pull request #(\d+)/)'

#13 Updated by Loïc Dachary over 8 years ago

The following can't be the cause because it allows pool deletion by default and no ceph-qa-suite branch changes that.

The following, merged Sept 4th in Firefly would return ENOTSUP, not a candidate.

#14 Updated by Loïc Dachary over 8 years ago

Nothing significant was merged in the hammer ceph-qa-suite branch before the problems started showing up. The giant-x and firefly-x upgrade suites have not been modified since july 2015 in the hammer branch.

#15 Updated by Loïc Dachary over 8 years ago

First occurrence with the tests involving the infernalis branch September 13, 2015.

$ for run in *2015-{08,09}*upgrade* ; do for job in $run/* ; do test -d $job || continue ; config=$job/config.yaml ; test -\
f $config || continue ; summary=$job/summary.yaml ; test -f $summary || continue ; if shyaml get-value branch < $config | grep -q infernalis && shyaml ge\
t-value success < $summary | grep -qi false && grep -q 'error -4' $job/teuthology.log  ; then echo $job ; fi ; done ; done
teuthology-2015-09-09_13:18:06-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1047659
teuthology-2015-09-09_13:18:06-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1047660
teuthology-2015-09-09_13:18:06-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1047661
teuthology-2015-09-09_13:18:06-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1047662
teuthology-2015-09-11_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1050696
teuthology-2015-09-11_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1050699
teuthology-2015-09-11_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1050701
teuthology-2015-09-14_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1055798
teuthology-2015-09-14_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1055799
teuthology-2015-09-14_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1055800
teuthology-2015-09-14_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1055801
teuthology-2015-09-16_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1060621
teuthology-2015-09-16_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1060622
teuthology-2015-09-16_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1060623
teuthology-2015-09-16_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1060624
teuthology-2015-09-23_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1066645
teuthology-2015-09-23_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1066646
teuthology-2015-09-23_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1066647
teuthology-2015-09-23_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1066648
teuthology-2015-09-28_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1074178
teuthology-2015-09-28_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1074179
teuthology-2015-09-28_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1074180
teuthology-2015-09-28_13:18:08-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1074181
teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/1076295
teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/1076296
teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/1076297
teuthology-2015-09-29_10:41:00-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-vps/1076298
teuthology-2015-09-30_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1078473
teuthology-2015-09-30_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1078474
teuthology-2015-09-30_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1078475
teuthology-2015-09-30_13:18:07-upgrade:firefly-hammer-x:parallel-infernalis-distro-basic-multi/1078476

#16 Updated by Loïc Dachary over 8 years ago

In the hammer upgrade/giant-x install the clients are installed with giant. The same giant client are used during the whole upgrade test, meaning tests from giant are actually run all along.

#17 Updated by Loïc Dachary over 8 years ago

Because of the crush ruleset validation introduced by https://github.com/ceph/ceph/pull/5276 in hammer September 6th, a pool creation may fail if

  • the crush rule is invalid
  • the validation takes more than X seconds, X being < to the mon lease

The implicit crush ruleset in firefly had max_size 20 which makes it very expensive to verify it. It is the most probable cause of EINTR, meaning crush validation did not complete on time

#19 Updated by Loïc Dachary over 8 years ago

  • Status changed from In Progress to Duplicate

#20 Updated by Yuri Weinstein over 8 years ago

  • Related to Bug #13664: tests: testprofile must be removed before it is re-created added

Also available in: Atom PDF