Project

General

Profile

Bug #53294

rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush

Added by Neha Ojha over 2 years ago. Updated over 1 year ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-11-16T03:22:42.142 INFO:tasks.workunit.client.0.smithi007.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsPP.TierFlushDuringFlush
2021-11-16T03:23:18.724 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:18.724+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417)
2021-11-16T03:23:19.734 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:19.734+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417)
2021-11-16T03:23:20.779 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T03:23:20.778+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 1 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417)
... slow ops continue

and then

2021-11-16T06:14:37.298 INFO:tasks.ceph.osd.7.smithi008.stderr:2021-11-16T06:14:37.298+0000 7f9ab23b6700 -1 osd.7 418 get_health_metrics reporting 17 slow ops, oldest is osd_op(client.5221.0:7666 216.13 216:c97e4bba:test-rados-api-smithi007-37272-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e417)
2021-11-16T06:14:37.934 INFO:tasks.workunit.client.0.smithi007.stderr:++ cleanup
2021-11-16T06:14:37.934 INFO:tasks.workunit.client.0.smithi007.stderr:++ pkill -P 37050
2021-11-16T06:14:37.948 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-11-16T06:14:37.957 INFO:tasks.workunit.client.0.smithi007.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 37267 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2021-11-16T06:14:37.958 INFO:tasks.workunit.client.0.smithi007.stderr:++ true
2021-11-16T06:14:37.958 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0...

/a/yuriw-2021-11-15_19:24:05-rados-wip-yuri8-testing-2021-11-15-0845-distro-basic-smithi/6504577


Related issues

Copied to RADOS - Backport #56655: quincy: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush Resolved
Copied to RADOS - Backport #56656: pacific: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush New

History

#1 Updated by Laura Flores about 2 years ago

/a/yuriw-2022-01-04_21:52:15-rados-wip-yuri7-testing-2022-01-04-1159-distro-default-smithi/6595525

#2 Updated by Laura Flores about 2 years ago

Later on in the example Neha originally posted (/a/yuriw-2021-11-15_19:24:05-rados-wip-yuri8-testing-2021-11-15-0845-distro-basic-smithi/6504577):

Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6cfddd5e1c7d5ede1367cf3087c21af44eb88356/teuthology/contextutil.py", line 33, in nested
    yield vars
  File "/home/teuthworker/src/github.com_ceph_ceph-c_ce5aedd82c63c16e3daf1ca23714cd4e95197841/qa/tasks/ceph.py", line 1905, in task
    osd_scrub_pgs(ctx, config)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_ce5aedd82c63c16e3daf1ca23714cd4e95197841/qa/tasks/ceph.py", line 1299, in osd_scrub_pgs
    raise RuntimeError('Exiting scrub checking -- not all pgs scrubbed.')
RuntimeError: Exiting scrub checking -- not all pgs scrubbed.

#3 Updated by Laura Flores about 2 years ago

  • Related to Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount added

#4 Updated by Laura Flores about 2 years ago

/a/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642148

#5 Updated by Laura Flores about 2 years ago

/a/sseshasa-2022-02-24_11:27:07-rados-wip-45118-45121-quincy-testing-distro-default-smithi/6704402

2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700  1 -- 172.21.15.59:0/2561782390 shutdown_connections
2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700  1 -- 172.21.15.59:0/2561782390 wait complete.
2022-02-24T16:24:14.406 INFO:teuthology.orchestra.run.smithi059.stderr:2022-02-24T16:24:14.398+0000 7fce5cd9e700  1 librados: shutdown
2022-02-24T16:24:14.971 INFO:tasks.ceph.osd.3.smithi131.stderr:2022-02-24T16:24:14.967+0000 7ff37a224700 -1 osd.3 435 get_health_metrics reporting 12 slow ops, oldest is osd_op(client.5375.0:7678 216.1f 216:f858ebb2:test-rados-api-smithi131-37840-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e434)
2022-02-24T16:24:15.976 INFO:tasks.workunit.client.0.smithi131.stderr:++ cleanup
2022-02-24T16:24:15.977 INFO:tasks.workunit.client.0.smithi131.stderr:++ pkill -P 37668
2022-02-24T16:24:16.021 INFO:tasks.ceph.osd.3.smithi131.stderr:2022-02-24T16:24:16.018+0000 7ff37a224700 -1 osd.3 435 get_health_metrics reporting 12 slow ops, oldest is osd_op(client.5375.0:7678 216.1f 216:f858ebb2:test-rados-api-smithi131-37840-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e434)
2022-02-24T16:24:16.064 INFO:tasks.workunit.client.0.smithi131.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 37837 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2022-02-24T16:24:16.064 INFO:tasks.workunit.client.0.smithi131.stderr:++ true
2022-02-24T16:24:16.065 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-02-24T16:24:16.066 INFO:tasks.workunit.client.0.smithi131.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsPP.TierFlushDuringFl
2022-02-24T16:24:16.066 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-02-24T16:24:16.067 DEBUG:teuthology.orchestra.run.smithi131:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-02-24T16:24:16.329 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6e50212646f4f394e696f924e961e76f4719acaa/qa/tasks/workunit.py", line 135, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_6e50212646f4f394e696f924e961e76f4719acaa/qa/tasks/workunit.py", line 427, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_6b607b46862a6b148fb1608e09b31d367f094555/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi131 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=6e50212646f4f394e696f924e961e76f4719acaa TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
2022-02-24T16:24:16.426 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/organizations/ceph/?query=bc56e659c18e46069fb53816b130a57f

#6 Updated by Laura Flores almost 2 years ago

  • Status changed from New to Duplicate

Marking this one as the duplicate because the other Tracker has the PR attached to it.

#7 Updated by Laura Flores almost 2 years ago

  • Related to deleted (Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount)

#8 Updated by Laura Flores almost 2 years ago

  • Duplicates Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount added

#9 Updated by Laura Flores over 1 year ago

  • Status changed from Duplicate to New
  • Backport set to quincy

There are still some occurrences of this type of failure in Quincy, which includes the backport of #53855. So, I am removing the duplicate status of this Tracker and re-opening this since it seems like it requires a different fix.

/a/yuriw-2022-06-16_16:41:04-rados-wip-yuri6-testing-2022-06-16-0651-quincy-distro-default-smithi/6882364

2022-06-17T00:05:17.348 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:17.346+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:18.359 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:18.359+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:19.332 INFO:tasks.ceph.osd.1.smithi101.stderr:2022-06-17T00:05:19.332+0000 7fc37cb1e700 -1 osd.1 420 get_health_metrics reporting 27 slow ops, oldest is osd_op(client.5283.0:7687 218.12 218:4e3eb20c:test-rados-api-smithi101-108011-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected+supports_pool_eio e419)
2022-06-17T00:05:19.948 INFO:tasks.workunit.client.0.smithi101.stderr:++ cleanup
2022-06-17T00:05:19.948 INFO:tasks.workunit.client.0.smithi101.stderr:++ pkill -P 107805
2022-06-17T00:05:19.972 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-06-17T00:05:19.992 INFO:tasks.workunit.client.0.smithi101.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 108002 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2022-06-17T00:05:19.992 INFO:tasks.workunit.client.0.smithi101.stderr:++ true
2022-06-17T00:05:19.993 INFO:tasks.workunit.client.0.smithi101.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsPP.TierFlushDuringFlus
2022-06-17T00:05:19.994 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-06-17T00:05:19.995 DEBUG:teuthology.orchestra.run.smithi101:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-06-17T00:05:20.242 ERROR:teuthology.run_tasks:Saw exception from tasks.

#10 Updated by Laura Flores over 1 year ago

  • Duplicates deleted (Bug #53855: rados/test.sh hangs while running LibRadosTwoPoolsPP.ManifestFlushDupCount)

#11 Updated by Neha Ojha over 1 year ago

  • Assignee set to Myoungwon Oh

Hi Myoungwon Oh, can you please help take a look at this issue?

#12 Updated by Myoungwon Oh over 1 year ago

OK, I'll take a look.

#13 Updated by Myoungwon Oh over 1 year ago

I think this is the same issue as https://tracker.ceph.com/issues/53855.

#14 Updated by Laura Flores over 1 year ago

Myoungwon Oh wrote:

I think this is the same issue as https://tracker.ceph.com/issues/53855.

I thought so too, but the instance from https://tracker.ceph.com/issues/53294#note-9 (/a/yuriw-2022-06-16_16:41:04-rados-wip-yuri6-testing-2022-06-16-0651-quincy-distro-default-smithi/6882364) was caught on a Quincy test branch that contained a q backport commit from #53855. So that indicates to me that this might have a different root cause (or that the original fix hasn't stopped the issue).

#15 Updated by Myoungwon Oh over 1 year ago

I'll take a closer look.

#16 Updated by Myoungwon Oh over 1 year ago

https://github.com/ceph/ceph/pull/46866

I found that there is no reply if sending message with invalid pool information (e.g., 0:0000:foo:head).
So, I posted the PR to return error code early.

#17 Updated by Neha Ojha over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46866

#18 Updated by Laura Flores over 1 year ago

Thank you Myoungwon!

#19 Updated by Laura Flores over 1 year ago

Potential Pacific occurrence? Although this one is catching on LibRadosTwoPoolsPP.CachePin rather than LibRadosTwoPoolsPP.TierFlushDuringFlush.

/a/yuriw-2022-07-05_17:38:43-rados-wip-yuri4-testing-2022-07-05-0719-pacific-distro-default-smithi/6915020

2022-07-06T02:26:57.748 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:57.747+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ cleanup
2022-07-06T02:26:58.757 INFO:tasks.workunit.client.0.smithi191.stderr:++ pkill -P 104256
2022-07-06T02:26:58.780 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-06T02:26:58.781 INFO:tasks.workunit.client.0.smithi191.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 104457 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2022-07-06T02:26:58.782 INFO:tasks.workunit.client.0.smithi191.stderr:++ true
2022-07-06T02:26:58.783 INFO:tasks.workunit.client.0.smithi191.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.CacheP
2022-07-06T02:26:58.783 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-06T02:26:58.785 DEBUG:teuthology.orchestra.run.smithi191:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-06T02:26:58.797 INFO:tasks.ceph.osd.5.smithi191.stderr:2022-07-06T02:26:58.796+0000 7f83aa443700 -1 osd.5 399 get_health_metrics reporting 15 slow ops, oldest is osd_op(client.4785.0:7419 214.f 214:f5e1fadd:test-rados-api-smithi191-104463-84::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e398)
2022-07-06T02:26:59.053 ERROR:teuthology.run_tasks:Saw exception from tasks.

#20 Updated by Laura Flores over 1 year ago

  • Backport changed from quincy to quincy,pacific

#21 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939431

2022-07-20T13:40:02.854 INFO:tasks.ceph.osd.4.smithi172.stderr:2022-07-20T13:40:02.846+0000 7ff835809700 -1 osd.4 402 get_health_metrics reporting 25 slow ops, oldest is osd_op(client.5196.0:7489 214.5 214:a1df339b:test-rados-api-smithi007-108562-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e401)
2022-07-20T13:40:03.824 INFO:tasks.workunit.client.0.smithi007.stderr:++ cleanup
2022-07-20T13:40:03.825 INFO:tasks.workunit.client.0.smithi007.stderr:++ pkill -P 108390
2022-07-20T13:40:03.841 DEBUG:teuthology.orchestra.run:got remote process result: 124
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 108558 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stderr:++ true
2022-07-20T13:40:03.843 INFO:tasks.workunit.client.0.smithi007.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsPP.Cache
2022-07-20T13:40:03.844 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-07-20T13:40:03.845 DEBUG:teuthology.orchestra.run.smithi007:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-07-20T13:40:03.903 INFO:tasks.ceph.osd.4.smithi172.stderr:2022-07-20T13:40:03.895+0000 7ff835809700 -1 osd.4 402 get_health_metrics reporting 25 slow ops, oldest is osd_op(client.5196.0:7489 214.5 214:a1df339b:test-rados-api-smithi007-108562-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e401)
2022-07-20T13:40:04.108 ERROR:teuthology.run_tasks:Saw exception from tasks.

#22 Updated by Neha Ojha over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#23 Updated by Backport Bot over 1 year ago

  • Copied to Backport #56655: quincy: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush added

#24 Updated by Backport Bot over 1 year ago

  • Copied to Backport #56656: pacific: rados/test.sh hangs while running LibRadosTwoPoolsPP.TierFlushDuringFlush added

#25 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

Also available in: Atom PDF