Bug #58130: LibRadosAio.SimpleWrite hang and pkill - RADOS - Ceph

Actions

Copy link

Bug #58130

open

LibRadosAio.SimpleWrite hang and pkill

Added by Laura Flores over 1 year ago. Updated 3 months ago.

Status:

In Progress

Priority:

Normal

Assignee:

Nitzan Mordechai

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific,quincy,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

A rados api test experienced a failure after the last global tests had successfully run.

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/teuthology.log

2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TrySetDedupTier
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (6148 ms)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (299040 ms total)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp:
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [==========] 72 tests from 4 test suites ran. (1828588 ms total)
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [  PASSED  ] 72 tests.
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ exit 1
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ cleanup
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ pkill -P 120721
2022-11-30T02:58:42.317 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-30T02:58:42.317 INFO:tasks.workunit.client.0.smithi018.stderr:+ true
2022-11-30T02:58:42.318 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

...
2022-11-30T02:58:42.841 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.18:3301/0
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700  1 -- 172.21.15.18:0/1567830916 --> [v2:172.21.15.18:3301/0,v1:172.21.15.18:6790/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fb4b40af4c0 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.498+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 4 ==== mgrmap(e 4) v1 ==== 75039+0+0 (secure 0 0 0) 0x7fb4a002c070 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 5 ==== osd_map(1390..1390 src has 691..1390) v4 ==== 6026+0+0 (secure 0 0 0) 0x7fb4a007b610 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_dispatch 0x7fb4b4001550 osd_map(1390..1390 src has 691..1390) v4
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map got epochs [1390,1390] > 0
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map decoding full epoch 1390
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 20 client.23765.objecter dump_active .. 0 homeless
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 10 monclient: get_auth_request con 0x7fb49c060380 auth_method 0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_handle_connect 0x7fb49c060380
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 secure :-1 s=READY pgs=1646 cs=0 l=1 rev1=1 rx=0x7fb4b4069ac0 tx=0x7fb4a4009060).ready entity=mgr.4103 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-11-30T02:58:42.845 DEBUG:teuthology.orchestra.run.smithi018:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+162146 (secure 0 0 0) 0x7fb4a004cdb0 con 0x7fb4b4068c10
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-11-30T02:58:42.847 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: _finish_command 1 = system:0
2022-11-30T02:58:42.848 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 134, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 426, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/remote.py", line 525, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi018 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bebb9581d1d6dc759a2591366d38060cec2b4275 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

There were no memory leaks detected by valgrind, but I did find this in the mon log around the time of the Traceback:

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/remote/smithi018/log/ceph-mon.a.log.gz

2022-11-30T02:58:40.999+0000 1d45e700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 71.3 is active+undersized+degraded, acting [5,2147483647,3]" 
            }
        ]
    },

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #58130

LibRadosAio.SimpleWrite hang and pkill

Updated by Laura Flores over 1 year ago

Updated by Brad Hubbard over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

Updated by Nitzan Mordechai over 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Nitzan Mordechai 11 months ago

Updated by Radoslaw Zarzynski 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Nitzan Mordechai 11 months ago

Updated by Radoslaw Zarzynski 11 months ago

Updated by Laura Flores 6 months ago

Updated by Nitzan Mordechai 6 months ago

Updated by Laura Flores 6 months ago

Updated by Radoslaw Zarzynski 6 months ago

Updated by Laura Flores 3 months ago

Updated by Nitzan Mordechai 3 months ago

Updated by Laura Flores 3 months ago

Updated by Laura Flores 3 months ago

Updated by Nitzan Mordechai 3 months ago

Updated by Nitzan Mordechai 3 months ago

Updated by Nitzan Mordechai 3 months ago

Updated by Laura Flores 2 months ago

Updated by Laura Flores 2 months ago

Updated by Nitzan Mordechai 2 months ago