Bug #53294
closedrados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush
100%
Description
2021-11-16T03:22:42.142 INFO:tasks.workunit.client.0.smithi007.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsPP.TierFlushDuringFlush 2021-11-16T03:23:18.724 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:18.724+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417) 2021-11-16T03:23:19.734 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:19.734+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417) 2021-11-16T03:23:20.779 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:20.778+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417) ... slow ops continue and then 2021-11-16T06:14:37.298 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T06:14:37.298+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 17 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417) 2021-11-16T06:14:37.934 INFO:tasks.workunit.client.0.smithi007.stderr:++ cleanup 2021-11-16T06:14:37.934 INFO:tasks.workunit.client.0.smithi007.stderr:++ pkill -P 37050 2021-11-16T06:14:37.948 DEBUG:teuthology.orchestra.run:got remote process result: 124 2021-11-16T06:14:37.957 INFO:tasks.workunit.client.0.smithi007.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 37267 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 2021-11-16T06:14:37.958 INFO:tasks.workunit.client.0.smithi007.stderr:++ true 2021-11-16T06:14:37.958 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0...
/a/yuriw-2021-11-15_19:24:05-rados-wip-yuri8-testing-2021-11-15-0845-distro-basic-smithi/6504577
Updated by Laura Flores over 2 years ago
/a/yuriw-2022-01-04_21:52:15-rados-wip-yuri7-testing-2022-01-04-1159-distro-default-smithi/6595525
Updated by Laura Flores over 2 years ago
Later on in the example Neha originally posted (/a/yuriw-2021-11-15_19:24:05-rados-wip-yuri8-testing-2021-11-15-0845-distro-basic-smithi/6504577):
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6cfddd5e1c7d5ede1367cf3087c21af44eb88356/teuthology/contextutil.py", line 33, in nested
yield vars
File "/home/teuthworker/src/github.com_ceph_ceph-c_ce5aedd82c63c16e3daf1ca23714cd4e95197841/qa/tasks/ceph.py", line 1905, in task
osd_scrub_pgs(ctx, config)
File "/home/teuthworker/src/github.com_ceph_ceph-c_ce5aedd82c63c16e3daf1ca23714cd4e95197841/qa/tasks/ceph.py", line 1299, in osd_scrub_pgs
raise RuntimeError('Exiting scrub checking -- not all pgs scrubbed.')
RuntimeError: Exiting scrub checking -- not all pgs scrubbed.
Updated by Laura Flores over 2 years ago
- Related to Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount added
Updated by Laura Flores about 2 years ago
/a/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642148
Updated by Laura Flores about 2 years ago
/a/sseshasa-2022-02-24_11:27:07-rados-wip-45118-45121-quincy-testing-distro-default-smithi/6704402
2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700 1 -- 172.21.15.59:0/2561782390 shutdown_connections
2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700 1 -- 172.21.15.59:0/2561782390 wait complete.
2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700 1 librados: shutdown
2022-02-24T16:24:14.971 INFO:tasks.ceph.osd.3.smithi131.stderr:2022-02-24T16:24:14.967+0000 7ff37a224700 -1 osd.3 435 get_health_metrics reporting 12 slow ops, oldest is osd_op(client.5375.0:7678 216.1f 216:f858ebb2:test-rados-api-smithi131-37840-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e434)
2022-02-24T16:24:15.976 INFO:tasks.workunit.client.0.smithi131.stderr:++ cleanup
2022-02-24T16:24:15.977 INFO:tasks.workunit.client.0.smithi131.stderr:++ pkill -P 37668
2022-02-24T16:24:16.021 INFO:tasks.ceph.osd.3.smithi131.stderr:2022-02-24T16:24:16.018+0000 7ff37a224700 -1 osd.3 435 get_health_metrics reporting 12 slow ops, oldest is osd_op(client.5375.0:7678 216.1f 216:f858ebb2:test-rados-api-smithi131-37840-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e434)
2022-02-24T16:24:16.064 INFO:tasks.workunit.client.0.smithi131.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 37837 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-02-24T16:24:16.064 INFO:tasks.workunit.client.0.smithi131.stderr:++ true
2022-02-24T16:24:16.065 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-02-24T16:24:16.066 INFO:tasks.workunit.client.0.smithi131.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsPP.TierFlushDuringFl
2022-02-24T16:24:16.066 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-02-24T16:24:16.067 DEBUG:teuthology.orchestra.run.smithi131:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-02-24T16:24:16.329 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/run_tasks.py", line 91, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/run_tasks.py", line 70, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_6e50212646f4f394e696f924e961e76f4719acaa/qa/tasks/workunit.py", line 135, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/git.ceph.com_ceph-c_6e50212646f4f394e696f924e961e76f4719acaa/qa/tasks/workunit.py", line 427, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/remote.py", line 509, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi131 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6e50212646f4f394e696f924e961e76f4719acaa TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
2022-02-24T16:24:16.426 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=bc56e659c18e46069fb53816b130a57f
Updated by Laura Flores about 2 years ago
- Status changed from New to Duplicate
Marking this one as the duplicate because the other Tracker has the PR attached to it.
Updated by Laura Flores about 2 years ago
- Related to deleted (Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount)
Updated by Laura Flores about 2 years ago
- Is duplicate of Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount added
Updated by Laura Flores almost 2 years ago
- Status changed from Duplicate to New
- Backport set to quincy
There are still some occurrences of this type of failure in Quincy, which includes the backport of #53855. So, I am removing the duplicate status of this Tracker and re-opening this since it seems like it requires a different fix.
/a/yuriw-2022-06-16_16:41:04-rados-wip-yuri6-testing-2022-06-16-0651-quincy-distro-default-smithi/6882364
2022-06-17T00:05:17.348 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:17.346+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:18.359 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:18.359+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:19.332 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:19.332+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:19.948 INFO:tasks.workunit.client.0.smithi101.stderr:++ cleanup
2022-06-17T00:05:19.948 INFO:tasks.workunit.client.0.smithi101.stderr:++ pkill -P 107805
2022-06-17T00:05:19.972 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-06-17T00:05:19.992 INFO:tasks.workunit.client.0.smithi101.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 108002 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-06-17T00:05:19.992 INFO:tasks.workunit.client.0.smithi101.stderr:++ true
2022-06-17T00:05:19.993 INFO:tasks.workunit.client.0.smithi101.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsPP.TierFlushDuringFlus
2022-06-17T00:05:19.994 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-06-17T00:05:19.995 DEBUG:teuthology.orchestra.run.smithi101:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-06-17T00:05:20.242 ERROR:teuthology.run_tasks:Saw exception from tasks.
Updated by Laura Flores almost 2 years ago
- Is duplicate of deleted (Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount)
Updated by Neha Ojha almost 2 years ago
- Assignee set to Myoungwon Oh
Hi Myoungwon Oh, can you please help take a look at this issue?
Updated by Myoungwon Oh almost 2 years ago
I think this is the same issue as https://tracker.ceph.com/issues/53855.
Updated by Laura Flores almost 2 years ago
Myoungwon Oh wrote:
I think this is the same issue as https://tracker.ceph.com/issues/53855.
I thought so too, but the instance from https://tracker.ceph.com/issues/53294#note-9 (/a/yuriw-2022-06-16_16:41:04-rados-wip-yuri6-testing-2022-06-16-0651-quincy-distro-default-smithi/6882364) was caught on a Quincy test branch that contained a q backport commit from #53855. So that indicates to me that this might have a different root cause (or that the original fix hasn't stopped the issue).
Updated by Myoungwon Oh almost 2 years ago
https://github.com/ceph/ceph/pull/46866
I found that there is no reply if sending message with invalid pool information (e.g., 0:0000:foo:head).
So, I posted the PR to return error code early.
Updated by Neha Ojha almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 46866
Updated by Laura Flores almost 2 years ago
Potential Pacific occurrence? Although this one is catching on LibRadosTwoPoolsPP.CachePin rather than LibRadosTwoPoolsPP.TierFlushDuringFlush.
/a/yuriw-2022-07-05_17:38:43-rados-wip-yuri4-testing-2022-07-05-0719-pacific-distro-default-smithi/6915020
2022-07-06T02:26:57.748 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:57.747+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ cleanup
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ pkill -P 104256
2022-07-06T02:26:58.780 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-06T02:26:58.781 INFO:tasks.workunit.client.0.smithi191.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 104457 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-07-06T02:26:58.782 INFO:tasks.workunit.client.0.smithi191.stderr:++ true
2022-07-06T02:26:58.783 INFO:tasks.workunit.client.0.smithi191.stdout: api_tier_pp: [ OK ] LibRadosTwoPoolsPP.CacheP
2022-07-06T02:26:58.783 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-06T02:26:58.785 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-06T02:26:58.797 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:58.796+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:59.053 ERROR:teuthology.run_tasks:Saw exception from tasks.
Updated by Laura Flores almost 2 years ago
- Backport changed from quincy to quincy,pacific
Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939431
2022-07-20T13:40:02.854 INFO:tasks.ceph.osd.4.smithi172.stderr:2022-07-20T13:40:02.846+0000 7ff835809700 -1 osd.4 402 get_health_metrics reporting 25 slow ops, oldest is osd_op(client.5196.0:7489 214.5 214:a1df339b:test-rados-api-smithi007-108562-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e401)
2022-07-20T13:40:03.824 INFO:tasks.workunit.client.0.smithi007.stderr:++ cleanup
2022-07-20T13:40:03.825 INFO:tasks.workunit.client.0.smithi007.stderr:++ pkill -P 108390
2022-07-20T13:40:03.841 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 108558 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\""
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stderr:++ true
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stdout: api_tier_pp: [ OK ] LibRadosTwoPoolsPP.Cache
2022-07-20T13:40:03.844 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-20T13:40:03.845 DEBUG:teuthology.orchestra.run.smithi007:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-20T13:40:03.903 INFO:tasks.ceph.osd.4.smithi172.stderr:2022-07-20T13:40:03.895+0000 7ff835809700 -1 osd.4 402 get_health_metrics reporting 25 slow ops, oldest is osd_op(client.5196.0:7489 214.5 214:a1df339b:test-rados-api-smithi007-108562-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e401)
2022-07-20T13:40:04.108 ERROR:teuthology.run_tasks:Saw exception from tasks.
Updated by Neha Ojha over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 1 year ago
- Copied to Backport #56655: quincy: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush added
Updated by Backport Bot over 1 year ago
- Copied to Backport #56656: pacific: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush added
Updated by Konstantin Shalygin 24 days ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100