Project

General

Profile

Bug #40119

api_tier_pp hung causing a dead job

Added by David Zafman almost 5 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://pulpito.ceph.com/dzafman-2019-05-31_07:47:29-rados-wip-zafman-testing-distro-basic-smithi/3992631

2019-06-01T01:25:37.022 INFO:tasks.workunit.client.0.smithi019.stdout:              api_tier_pp: [  PASSED  ] 57 tests.
2019-06-01T04:15:36.987 INFO:tasks.workunit.client.0.smithi019.stderr:++ cleanup
2019-06-01T04:15:37.119 INFO:tasks.workunit.client.0.smithi019.stderr:++ pkill -P 124812019-06-01T04:15:37.120 INFO:tasks.workunit.client.0.smithi019.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 12494 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2019-06-01T04:15:37.120 INFO:tasks.workunit.client.0.smithi019.stderr:++ true
2019-06-01T04:15:37.121 DEBUG:teuthology.orchestra.run:got remote process result: 124
2019-06-01T04:15:37.121 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0...
2019-06-01T04:15:37.121 INFO:teuthology.orchestra.run.smithi019:Running:
2019-06-01T04:15:37.121 INFO:teuthology.orchestra.run.smithi019:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-06-01T04:15:37.291 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-zafman-testing/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 205, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 437, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test rados/test.sh) on smithi019 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=d133d66dbb783f9527c36fef9fb02590476825db TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

Related issues

Related to RADOS - Bug #50042: rados/test.sh: api_watch_notify failures Resolved
Related to RADOS - Bug #45423: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite Pending Backport

History

#1 Updated by Deepika Upadhyay almost 3 years ago

2021-04-06T17:50:38.226 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:38.222+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477)
2021-04-06T17:50:39.191 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:39.189+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477)
2021-04-06T17:50:40.183 INFO:tasks.ceph.osd.5.smithi052.stderr:2021-04-06T17:50:40.182+0000 7fc367c08700 -1 osd.5 1478 get_health_metrics reporting 10 slow ops, oldest is osd_op(client.4721.0:7282 212.19 212:9ec63101:test-rados-api-smithi049-38364-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e1477)
2021-04-06T17:50:40.808 INFO:tasks.workunit.client.0.smithi049.stderr:++ cleanup
2021-04-06T17:50:40.809 INFO:tasks.workunit.client.0.smithi049.stderr:++ pkill -P 38193
2021-04-06T17:50:40.838 DEBUG:teuthology.orchestra.run:got remote process result: 124
2021-04-06T17:50:40.840 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh: line 10: 38361 Terminated              bash -o pipefail -exc "ceph_test_rados_$f $color 2>&1 | tee ceph_test_rados_$ff.log | sed \"s/^/$r: /\"" 
2021-04-06T17:50:40.840 INFO:tasks.workunit.client.0.smithi049.stderr:++ true
2021-04-06T17:50:40.841 INFO:tasks.workunit.client.0.smithi049.stdout:              api_tier_pp: [       OK ] LibR
2021-04-06T17:50:40.841 INFO:tasks.workunit:Stopping ['rados/test.sh', 'rados/test_pool_quota.sh'] on client.0...
2021-04-06T17:50:40.843 DEBUG:teuthology.orchestra.run.smithi049:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2021-04-06T17:50:41.103 ERROR:teuthology.run_tasks:Saw exception from tasks.

/ceph/teuthology-archive/ideepika-2021-04-06_14:26:42-rados-wip-deepika-testing-2021-04-05-0643-distro-basic-smithi/6024540/teuthology.log
/ceph/teuthology-archive/ideepika-2021-04-06_14:26:42-rados-wip-deepika-testing-2021-04-05-0643-distro-basic-smithi/6024585/teuthology.log-

#2 Updated by Brad Hubbard almost 3 years ago

  • Project changed from Messengers to RADOS

/a/bhubbard-2021-04-26_22:38:21-rados-master-distro-basic-smithi/6075940

In this instance the slow requests are on osd.2 and the first slow request is here.

2021-04-27T02:15:21.984+0000 14963700 20 slow request osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e629) initiated 2021-04-27T02:14:51.765087+0000 currently started

Just previous to that we see we are in 'start_dedup' (do_osd_ops->start_flush->start_dedup) and I believe these calls succeed for each of the chunks.

2021-04-27T02:14:52.114+0000 351f0700 10 osd.2 pg_epoch: 629 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] start_dedup oid: 215:3b953426:test-rados-api-smithi050-37476-88::foo:head tid: 47 target: 0:00000000:test-rados-api-smithi050-37476-88::377bcb682b88ae663bbaa748bcd3f97607e12e28:0 offset: 6855 length: 1337

From then on we see this repeating sequence and the slow op never clears.

2021-04-27T02:39:39.932+0000 311e8700 20 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] do_op: op osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e630) v8
2021-04-27T02:39:39.932+0000 311e8700 10 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] do_op osd_op(client.14804.0:7937 215.1c 215:3b953426:test-rados-api-smithi050-37476-88::foo:head [tier-flush] snapc 0=[] RETRY=1 ondisk+retry+read+ignore_cache+known_if_redirected e630) v8 may_cache -> write-ordered flags ondisk+retry+read+ignore_cache+known_if_redirected
2021-04-27T02:39:39.933+0000 311e8700 10 osd.2 pg_epoch: 630 pg[215.1c( v 628'2 (0'0,628'2] local-lis/les=622/623 n=1 ec=622/622 lis/c=622/622 les/c/f=623/623/0 sis=622) [2,6] r=0 lpr=622 crt=628'2 lcod 626'1 mlcod 626'1 active+clean] wait_for_blocked_object 215:3b953426:test-rados-api-smithi050-37476-88::foo:head 0x10215180

Before each of these sequences I'm seeing the following sequence.

2021-04-27T02:39:39.927+0000 d706700  1 -- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 legacy=0x94b0150 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 52
2021-04-27T02:39:39.927+0000 d706700  1 -- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 legacy=0x94b0150 unknown :6800 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2021-04-27T02:39:39.927+0000 d706700  1 --1- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 0x94b0150 :6800 s=OPENED pgs=22 cs=1 l=1).handle_message read tag failed
2021-04-27T02:39:39.927+0000 d706700  1 --1- v1:172.21.15.50:6800/32413 >> v1:172.21.15.50:0/1070970930 conn(0xa1a73d0 0x94b0150 :6800 s=OPENED pgs=22 cs=1 l=1).fault on lossy channel, failing

The target here is client.14804, the client waiting for the tier-flush to complete.

Still looking at this.

#3 Updated by Brad Hubbard almost 3 years ago

  • Related to Bug #50042: rados/test.sh: api_watch_notify failures added

#4 Updated by Brad Hubbard almost 3 years ago

  • Related to Bug #45423: api_tier_pp: [ FAILED ] LibRadosTwoPoolsPP.HitSetWrite added

Also available in: Atom PDF