Actions
Bug #58130
openLibRadosAio.SimpleWrite hang and pkill
% Done:
0%
Source:
Tags:
Backport:
pacific,quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
A rados api test experienced a failure after the last global tests had successfully run.
/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/teuthology.log
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ RUN ] LibRadosTwoPoolsECPP.TrySetDedupTier
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ OK ] LibRadosTwoPoolsECPP.TrySetDedupTier (6148 ms)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [----------] 22 tests from LibRadosTwoPoolsECPP (299040 ms total)
2022-11-30T02:58:42.308 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp:
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [----------] Global test environment tear-down
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [==========] 72 tests from 4 test suites ran. (1828588 ms total)
2022-11-30T02:58:42.309 INFO:tasks.workunit.client.0.smithi018.stdout: api_tier_pp: [ PASSED ] 72 tests.
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ exit 1
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ cleanup
2022-11-30T02:58:42.310 INFO:tasks.workunit.client.0.smithi018.stderr:+ pkill -P 120721
2022-11-30T02:58:42.317 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-11-30T02:58:42.317 INFO:tasks.workunit.client.0.smithi018.stderr:+ true
2022-11-30T02:58:42.318 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
...
2022-11-30T02:58:42.841 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.18:3301/0
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.438+0000 7fb4ba7a0700 1 -- 172.21.15.18:0/1567830916 --> [v2:172.21.15.18:3301/0,v1:172.21.15.18:6790/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7fb4b40af4c0 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.498+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 4 ==== mgrmap(e 4) v1 ==== 75039+0+0 (secure 0 0 0) 0x7fb4a002c070 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=NONE pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0).connect
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 5 ==== osd_map(1390..1390 src has 691..1390) v4 ==== 6026+0+0 (secure 0 0 0) 0x7fb4a007b610 con 0x7fb4b4068c10
2022-11-30T02:58:42.842 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_dispatch 0x7fb4b4001550 osd_map(1390..1390 src has 691..1390) v4
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 3 client.23765.objecter handle_osd_map got epochs [1390,1390] > 0
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 3 client.23765.objecter handle_osd_map decoding full epoch 1390
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 20 client.23765.objecter dump_active .. 0 homeless
2022-11-30T02:58:42.843 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 10 monclient: get_auth_request con 0x7fb49c060380 auth_method 0
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb497fff700 10 client.23765.objecter ms_handle_connect 0x7fb49c060380
2022-11-30T02:58:42.844 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.499+0000 7fb4b37fe700 1 --2- 172.21.15.18:0/1567830916 >> v2:172.21.15.37:6800/108913 conn(0x7fb49c060380 0x7fb49c062840 secure :-1 s=READY pgs=1646 cs=0 l=1 rev1=1 rx=0x7fb4b4069ac0 tx=0x7fb4a4009060).ready entity=mgr.4103 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-11-30T02:58:42.845 DEBUG:teuthology.orchestra.run.smithi018:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph -- tell 'osd.*' injectargs --osd_enable_op_tracker=false
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 1 -- 172.21.15.18:0/1567830916 <== mon.2 v2:172.21.15.18:3301/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0 v0) v1 ==== 72+0+162146 (secure 0 0 0) 0x7fb4a004cdb0 con 0x7fb4b4068c10
2022-11-30T02:58:42.846 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-11-30T02:58:42.847 INFO:teuthology.orchestra.run.smithi018.stderr:2022-11-30T02:58:42.773+0000 7fb497fff700 10 monclient: _finish_command 1 = system:0
2022-11-30T02:58:42.848 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 103, in run_tasks
manager = run_one_task(taskname, ctx=ctx, config=config)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/run_tasks.py", line 82, in run_one_task
return task(**kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 134, in task
coverage_and_limits=not config.get('no_coverage_and_limits', None))
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 84, in __exit__
for result in self:
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 98, in __next__
resurrect_traceback(result)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 30, in resurrect_traceback
raise exc.exc_info[1]
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/parallel.py", line 23, in capture_traceback
return func(*args, **kwargs)
File "/home/teuthworker/src/github.com_ceph_ceph-c_bebb9581d1d6dc759a2591366d38060cec2b4275/qa/tasks/workunit.py", line 426, in _run_tests
label="workunit test {workunit}".format(workunit=workunit)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/remote.py", line 525, in run
r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 455, in run
r.wait()
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 161, in wait
self._raise_for_status()
File "/home/teuthworker/src/git.ceph.com_teuthology_4da97cf64e542f347ec47b7bdbe5eca99759f9b7/teuthology/orchestra/run.py", line 183, in _raise_for_status
node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test rados/test.sh) on smithi018 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bebb9581d1d6dc759a2591366d38060cec2b4275 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 ALLOW_TIMEOUTS=1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/rados/test.sh'
There were no memory leaks detected by valgrind, but I did find this in the mon log around the time of the Traceback:
/a/yuriw-2022-11-29_22:29:58-rados-wip-yuri10-testing-2022-11-29-1005-pacific-distro-default-smithi/7097464/remote/smithi018/log/ceph-mon.a.log.gz
2022-11-30T02:58:40.999+0000 1d45e700 20 mon.a@0(leader).mgrstat health checks:
{
"PG_DEGRADED": {
"severity": "HEALTH_WARN",
"summary": {
"message": "Degraded data redundancy: 1/5 objects degraded (20.000%), 1 pg degraded",
"count": 1
},
"detail": [
{
"message": "pg 71.3 is active+undersized+degraded, acting [5,2147483647,3]"
}
]
},
Actions