Bug #40119
api_tier_pp hung causing a dead job
Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2019-06-01T01:25:37.022 INFO:tasks.workunit.client.0.smithi019.stdout: api_tier_pp: [ PASSED ] 57 tests. 2019-06-01T04:15:36.987 INFO:tasks.workunit.client.0.smithi019.stderr:++ cleanup 2019-06-01T04:15:37.119 INFO:tasks.workunit.client.0.smithi019.stderr:++ pkill -P 124812019-06-01T04:15:37.120 INFO:tasks.workunit.client.0.smithi019.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 12494 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 2019-06-01T04:15:37.120 INFO:tasks.workunit.client.0.smithi019.stderr:++ true 2019-06-01T04:15:37.121 DEBUG:teuthology.orchestra.run:got remote process result: 124 2019-06-01T04:15:37.121 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0... 2019-06-01T04:15:37.121 INFO:teuthology.orchestra.run.smithi019:Running: 2019-06-01T04:15:37.121 INFO:teuthology.orchestra.run.smithi019:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2019-06-01T04:15:37.291 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing/qa/tasks/workunit.py", line 123, in task timeout=timeout,cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing/qa/tasks/workunit.py", line 409, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed (workunit test rados/test.sh) on smithi019 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d133d66dbb783f9527c36fef9fb02590476825db TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
Related issues
History
#1 Updated by Deepika Upadhyay almost 3 years ago
2021-04-06T17:50:38.226 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:38.222+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477) 2021-04-06T17:50:39.191 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:39.189+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477) 2021-04-06T17:50:40.183 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:40.182+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477) 2021-04-06T17:50:40.808 INFO:tasks.workunit.client.0.smithi049.stderr:++ cleanup 2021-04-06T17:50:40.809 INFO:tasks.workunit.client.0.smithi049.stderr:++ pkill -P 38193 2021-04-06T17:50:40.838 DEBUG:teuthology.orchestra.run:got remote process result: 124 2021-04-06T17:50:40.840 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 38361 Terminated bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 2021-04-06T17:50:40.840 INFO:tasks.workunit.client.0.smithi049.stderr:++ true 2021-04-06T17:50:40.841 INFO:tasks.workunit.client.0.smithi049.stdout: api_tier_pp: [ OK ] LibR 2021-04-06T17:50:40.841 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0... 2021-04-06T17:50:40.843 DEBUG:teuthology.orchestra.run.smithi049:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2021-04-06T17:50:41.103 ERROR:teuthology.run_tasks:Saw exception from tasks.
/ceph/teuthology-archive/ideepika-2021-04-06_14:26:42-rados-wip-deepika-testing-2021-04-05-0643-distro-basic-smithi/6024540/teuthology.log
/ceph/teuthology-archive/ideepika-2021-04-06_14:26:42-rados-wip-deepika-testing-2021-04-05-0643-distro-basic-smithi/6024585/teuthology.log-
#2 Updated by Brad Hubbard almost 3 years ago
- Project changed from Messengers to RADOS
/a/bhubbard-2021-04-26_22:38:21-rados-master-distro-basic-smithi/6075940
In this instance the slow requests are on osd.2 and the first slow request is here.
2021-04-27T02:15:21.984+0000 14963700 20 slow request osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e629) initiated 2021-04-27T02:14:51.765087+0000 currently started
Just previous to that we see we are in 'start_dedup' (do_osd_ops->start_flush->start_dedup) and I believe these calls succeed for each of the chunks.
2021-04-27T02:14:52.114+0000 351f0700 10 osd.2 pg_epoch: 629 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] start_dedup oid: 215:3b953426:test-rados-api-smithi050-37476-88::foo:head tid: 47 target: 0:00000000:test-rados-api-smithi050-37476-88::377bcb682b88ae663bbaa748bcd3f97607e12e28:0 offset: 6855 length: 1337
From then on we see this repeating sequence and the slow op never clears.
2021-04-27T02:39:39.932+0000 311e8700 20 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] do_op: op osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e630) v8 2021-04-27T02:39:39.932+0000 311e8700 10 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] do_op osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e630) v8 may_cache -> write-ordered flags ondisk+retry+read+ignore_cache+known_if_redirected 2021-04-27T02:39:39.933+0000 311e8700 10 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] wait_for_blocked_object 215:3b953426:test-rados-api-smithi050-37476-88::foo:head 0x10215180
Before each of these sequences I'm seeing the following sequence.
2021-04-27T02:39:39.927+0000 d706700 1 -- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 legacy=0x94b0150 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 52 2021-04-27T02:39:39.927+0000 d706700 1 -- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 legacy=0x94b0150 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed 2021-04-27T02:39:39.927+0000 d706700 1 --1- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 0x94b0150 :6800 s=OPENED pgs=22 cs=1 l=1).handle_message read tag failed 2021-04-27T02:39:39.927+0000 d706700 1 --1- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 0x94b0150 :6800 s=OPENED pgs=22 cs=1 l=1).fault on lossy channel, failing
The target here is client.14804, the client waiting for the tier-flush to complete.
Still looking at this.
#3 Updated by Brad Hubbard almost 3 years ago
- Related to Bug #50042: rados/test.sh: api_watch_notify failures added
#4 Updated by Brad Hubbard almost 3 years ago
- Related to Bug #45423: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite added