Bug #53767
closedqa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout
0%
Description
Description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-comp-zstd rados tasks/rados_cls_all validater/valgrind}
Failure reason:
Command failed (workunit test cls/test_cls_2pc_queue.sh) on smithi023 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=923a78b748f3bb78722c7300318f17cf5730a2ce TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/cls/test_cls_2pc_queue.sh'
/a/yuriw-2021-12-23_16:50:03-rados-wip-yuri6-testing-2021-12-22-1410-distro-default-smithi/6582533
2021-12-23T20:26:35.417 DEBUG:teuthology.orchestra.run.smithi023:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2021-12-23T20:26:35.418 INFO:teuthology.orchestra.run.smithi023.stderr:nodeep-scrub is unset
2021-12-23T20:26:35.419 INFO:tasks.ceph.osd.6.smithi125.stderr:2021-12-23T20:26:35.419+0000 ee6a700 -1 received signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childre
n=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --vgdb=yes --exit-on-first
-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 6 (PID: 35347) UID: 0
2021-12-23T20:26:35.518 INFO:tasks.ceph.osd.6.smithi125.stderr:2021-12-23T20:26:35.520+0000 ee6a700 -1 received signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childre
n=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --vgdb=yes --exit-on-first
-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 6 (PID: 35347) UID: 0
2021-12-23T20:26:35.590 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 189, in wrapper
return func(self)
File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
self.choose_action()()
File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 347, in kill_osd
self.ceph_manager.kill_osd(osd)
File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 2977, in kill_osd
self.ctx.daemons.get_daemon('osd', osd, self.cluster).stop()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/daemon/state.py", line 139, in stop
run.wait([self.proc], timeout=timeout)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 473, in wait
check_time()
File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds
Updated by Laura Flores over 2 years ago
Same failed test, and same Traceback message as reported above. Pasted here is another relevant part of the log that the first example also shares:
/a/yuriw-2022-01-12_21:37:22-rados-wip-yuri6-testing-2022-01-12-1131-distro-default-smithi/6611690
2022-01-13T02:56:15.155 INFO:tasks.ceph.osd.2.smithi016.stderr:2022-01-13T02:56:15.155+0000 ee6a700 -1 received signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childr
en=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.2.log --time-stamp=yes --vgdb=yes --exit-on-fir
st-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 2 (PID: 411866) UID: 0
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout:{
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout: "ops": [],
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout: "complaint_time": 30,
2022-01-13T02:56:15.175 INFO:teuthology.orchestra.run.smithi016.stdout: "num_blocked_ops": 0
2022-01-13T02:56:15.175 INFO:teuthology.orchestra.run.smithi016.stdout:}
2022-01-13T02:56:15.184 INFO:tasks.workunit.client.0.smithi016.stderr:2022-01-13T02:56:15.184+0000 7f32b248e6c0 1 < completed part key=_multipart_50GM8UG090B8.2~K03MI40GCRY8.2 size=134415
2022-01-13T02:56:15.185 DEBUG:teuthology.orchestra.run.smithi016:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_historic_ops
2022-01-13T02:56:15.225 INFO:teuthology.orchestra.run.smithi016.stdout:osd.7: {}
2022-01-13T02:56:15.226 INFO:teuthology.orchestra.run.smithi016.stderr:Error ENXIO: problem getting command descriptions from osd.0
2022-01-13T02:56:15.226 INFO:teuthology.orchestra.run.smithi016.stderr:osd.1: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.2: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.3: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.4: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.5: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.228 INFO:teuthology.orchestra.run.smithi016.stderr:osd.6: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.228 INFO:teuthology.orchestra.run.smithi016.stderr:osd.7: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.229 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-01-13T02:56:15.230 INFO:tasks.thrashosds.thrasher:Failed to tell all osds, ignoring
Updated by Laura Flores over 2 years ago
- Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added
Updated by Laura Flores about 2 years ago
/a/yuriw-2022-01-27_15:09:25-rados-wip-yuri6-testing-2022-01-26-1547-distro-default-smithi/6644093
2022-01-28T06:29:30.972 INFO:teuthology.orchestra.run.smithi070.stderr:osd.0: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.972 INFO:teuthology.orchestra.run.smithi070.stderr:osd.1: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:Error ENXIO: problem getting command descriptions from osd.2
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.3: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.4: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.5: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:teuthology.orchestra.run.smithi070.stderr:osd.6: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:teuthology.orchestra.run.smithi070.stderr:osd.7: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:tasks.thrashosds.thrasher:Failed to tell all osds, ignoring
Updated by Laura Flores about 2 years ago
/a/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642122
Updated by Kamoltat (Junior) Sirivadhna about 2 years ago
/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6718855
Updated by Radoslaw Zarzynski about 2 years ago
Hmm, the last reoccurance (/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6718855
) seems to be a dead job actually. Is there something I'm missing here?
Updated by Laura Flores over 1 year ago
- Status changed from New to Duplicate
Same failure on test_cls_2pc_queue.sh, but this one came with remote logs. I suspect this is a duplicate of #55809.
/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939513
<error>
<unique>0x105450d</unique>
<tid>1</tid>
<threadname>ceph-mon</threadname>
<kind>Leak_IndirectlyLost</kind>
<xwhat>
<text>24 bytes in 1 blocks are indirectly lost in loss record 6 of 33</text>
<leakedbytes>24</leakedbytes>
<leakedblocks>1</leakedblocks>
</xwhat>
<stack>
<frame>
<ip>0x4C38B6F</ip>
<obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
<fn>operator new[](unsigned long)</fn>
<dir>/builddir/build/BUILD/valgrind-3.19.0/coregrind/m_replacemalloc</dir>
<file>vg_replace_malloc.c</file>
<line>640</line>
</frame>
<frame>
<ip>0x543382B</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>ceph::buffer::v15_2_0::ptr_node::create_hypercombined(ceph::unique_leakable_ptr<ceph::buffer::v15_2_0::raw>)</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/common</dir>
<file>buffer.cc</file>
<line>2222</line>
</frame>
<frame>
<ip>0x53DD153</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>create</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/include</dir>
<file>buffer.h</file>
<line>376</line>
</frame>
<frame>
<ip>0x53DD153</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>ProtocolV2::read_frame()</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>ProtocolV2.cc</file>
<line>1063</line>
</frame>
<frame>
<ip>0x53E153B</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>ProtocolV2::run_continuation(Ct<ProtocolV2>&)</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>ProtocolV2.cc</file>
<line>47</line>
</frame>
<frame>
<ip>0x53E171E</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>ProtocolV2::read_event()</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>ProtocolV2.cc</file>
<line>479</line>
</frame>
<frame>
<ip>0x53A975B</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>AsyncConnection::process()</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>AsyncConnection.cc</file>
<line>464</line>
</frame>
<frame>
<ip>0x54043E5</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>Event.cc</file>
<line>449</line>
</frame>
<frame>
<ip>0x540A06B</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>operator()</fn>
<dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
<file>Stack.cc</file>
<line>53</line>
</frame>
<frame>
<ip>0x540A06B</ip>
<obj>/usr/lib64/ceph/libceph-common.so.2</obj>
<fn>std::_Function_handler<void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}>::_M_invoke(std::_Any_data const&)</fn>
<dir>/usr/include/c++/8/bits</dir>
<file>std_function.h</file>
<line>297</line>
</frame>
<frame>
<ip>0x10527BA2</ip>
<obj>/usr/lib64/libstdc++.so.6.0.25</obj>
</frame>
<frame>
<ip>0xFB7B1C9</ip>
<obj>/usr/lib64/libpthread-2.28.so</obj>
<fn>start_thread</fn>
</frame>
<frame>
<ip>0x10DCDDD2</ip>
<obj>/usr/lib64/libc-2.28.so</obj>
<fn>clone</fn>
</frame>
</stack>
</error>
Updated by Laura Flores over 1 year ago
- Related to deleted (Bug #52562: Thrashosds read error injection failed with error ENXIO)
Updated by Laura Flores over 1 year ago
- Is duplicate of Bug #55809: "Leak_IndirectlyLost" valgrind report on mon.c added
Updated by Nitzan Mordechai 6 months ago
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441321
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441167