Project

General

Profile

Actions

Bug #58130

open

LibRadosAio.SimpleWrite hang and pkill

Added by Laura Flores over 1 year ago. Updated 3 months ago.

Status:
In Progress
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

A rados api test experienced a failure after the last global tests had successfully run.

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/teuthology.log

2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [ RUN      ] LibRadosTwoPoolsECPP.TrySetDedupTier
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [       OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (6148 ms)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (299040 ms total)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp:
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [==========] 72 tests from 4 test suites ran. (1828588 ms total)
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout:              api_tier_pp: [  PASSED  ] 72 tests.
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ exit 1
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ cleanup
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ pkill -P 120721
2022-11-30T02:58:42.317 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-30T02:58:42.317 INFO:tasks.workunit.client.0.smithi018.stderr:+ true
2022-11-30T02:58:42.318 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

...
2022-11-30T02:58:42.841 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.18:3301/0
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700  1 -- 172.21.15.18:0/1567830916 --> [v2:172.21.15.18:3301/0,v1:172.21.15.18:6790/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fb4b40af4c0 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.498+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 4 ==== mgrmap(e 4) v1 ==== 75039+0+0 (secure 0 0 0) 0x7fb4a002c070 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 5 ==== osd_map(1390..1390 src has 691..1390) v4 ==== 6026+0+0 (secure 0 0 0) 0x7fb4a007b610 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_dispatch 0x7fb4b4001550 osd_map(1390..1390 src has 691..1390) v4
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map got epochs [1390,1390] > 0
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700  3 client.23765.objecter handle_osd_map decoding full epoch 1390
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 20 client.23765.objecter dump_active .. 0 homeless
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 10 monclient: get_auth_request con 0x7fb49c060380 auth_method 0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_handle_connect 0x7fb49c060380
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700  1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 secure :-1 s=READY pgs=1646 cs=0 l=1 rev1=1 rx=0x7fb4b4069ac0 tx=0x7fb4a4009060).ready entity=mgr.4103 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-11-30T02:58:42.845 DEBUG:teuthology.orchestra.run.smithi018:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700  1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+162146 (secure 0 0 0) 0x7fb4a004cdb0 con 0x7fb4b4068c10
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-11-30T02:58:42.847 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: _finish_command 1 = system:0
2022-11-30T02:58:42.848 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):

  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 103, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 82, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 134, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 426, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/remote.py", line 525, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi018 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bebb9581d1d6dc759a2591366d38060cec2b4275 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'

There were no memory leaks detected by valgrind, but I did find this in the mon log around the time of the Traceback:

/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/remote/smithi018/log/ceph-mon.a.log.gz

2022-11-30T02:58:40.999+0000 1d45e700 20 mon.a@0(leader).mgrstat health checks:
{
    "PG_DEGRADED": {
        "severity": "HEALTH_WARN",
        "summary": {
            "message": "Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded",
            "count": 1
        },
        "detail": [
            {
                "message": "pg 71.3 is active+undersized+degraded, acting [5,2147483647,3]" 
            }
        ]
    },


Related issues 3 (2 open1 closed)

Related to RADOS - Bug #57751: LibRadosAio.SimpleWritePP hang and pkillResolvedNitzan Mordechai

Actions
Related to RADOS - Bug #64504: aio ops queued but never executedNew

Actions
Related to RADOS - Bug #64670: LibRadosAioEC.RoundTrip2 hang and pkillNew

Actions
Actions

Also available in: Atom PDF