Bug #53767: qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout - RADOS - Ceph

Actions

Copy link

Bug #53767

closed

qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout

Added by Laura Flores over 2 years ago. Updated 6 months ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async objectstore/bluestore-comp-zstd rados tasks/rados_cls_all validater/valgrind}

Failure reason:

Command failed (workunit test cls/test_cls_2pc_queue.sh) on smithi023 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=923a78b748f3bb78722c7300318f17cf5730a2ce TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/cls/test_cls_2pc_queue.sh'

/a/yuriw-2021-12-23_16:50:03-rados-wip-yuri6-testing-2021-12-22-1410-distro-default-smithi/6582533

2021-12-23T20:26:35.417 DEBUG:teuthology.orchestra.run.smithi023:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_ops_in_flight
2021-12-23T20:26:35.418 INFO:teuthology.orchestra.run.smithi023.stderr:nodeep-scrub is unset
2021-12-23T20:26:35.419 INFO:tasks.ceph.osd.6.smithi125.stderr:2021-12-23T20:26:35.419+0000 ee6a700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childre
n=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --vgdb=yes --exit-on-first
-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 6  (PID: 35347) UID: 0
2021-12-23T20:26:35.518 INFO:tasks.ceph.osd.6.smithi125.stderr:2021-12-23T20:26:35.520+0000 ee6a700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childre
n=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.6.log --time-stamp=yes --vgdb=yes --exit-on-first
-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 6  (PID: 35347) UID: 0
2021-12-23T20:26:35.590 INFO:tasks.thrashosds.thrasher:Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 189, in wrapper
    return func(self)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 1412, in _do_thrash
    self.choose_action()()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 347, in kill_osd
    self.ceph_manager.kill_osd(osd)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_923a78b748f3bb78722c7300318f17cf5730a2ce/qa/tasks/ceph_manager.py", line 2977, in kill_osd
    self.ctx.daemons.get_daemon('osd', osd, self.cluster).stop()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/daemon/state.py", line 139, in stop
    run.wait([self.proc], timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_95a7d4799b562f3bbb5ec66107094963abd62fa1/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (50) after waiting for 300 seconds

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Laura Flores over 2 years ago

Same failed test, and same Traceback message as reported above. Pasted here is another relevant part of the log that the first example also shares:
/a/yuriw-2022-01-12_21:37:22-rados-wip-yuri6-testing-2022-01-12-1131-distro-default-smithi/6611690

2022-01-13T02:56:15.155 INFO:tasks.ceph.osd.2.smithi016.stderr:2022-01-13T02:56:15.155+0000 ee6a700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-childr
en=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.2.log --time-stamp=yes --vgdb=yes --exit-on-fir
st-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 2  (PID: 411866) UID: 0
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout:{
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout:    "ops": [],
2022-01-13T02:56:15.174 INFO:teuthology.orchestra.run.smithi016.stdout:    "complaint_time": 30,
2022-01-13T02:56:15.175 INFO:teuthology.orchestra.run.smithi016.stdout:    "num_blocked_ops": 0
2022-01-13T02:56:15.175 INFO:teuthology.orchestra.run.smithi016.stdout:}
2022-01-13T02:56:15.184 INFO:tasks.workunit.client.0.smithi016.stderr:2022-01-13T02:56:15.184+0000 7f32b248e6c0  1   < completed part key=_multipart_50GM8UG090B8.2~K03MI40GCRY8.2 size=134415
2022-01-13T02:56:15.185 DEBUG:teuthology.orchestra.run.smithi016:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok dump_historic_ops
2022-01-13T02:56:15.225 INFO:teuthology.orchestra.run.smithi016.stdout:osd.7: {}
2022-01-13T02:56:15.226 INFO:teuthology.orchestra.run.smithi016.stderr:Error ENXIO: problem getting command descriptions from osd.0
2022-01-13T02:56:15.226 INFO:teuthology.orchestra.run.smithi016.stderr:osd.1: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.2: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.3: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.4: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.227 INFO:teuthology.orchestra.run.smithi016.stderr:osd.5: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.228 INFO:teuthology.orchestra.run.smithi016.stderr:osd.6: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.228 INFO:teuthology.orchestra.run.smithi016.stderr:osd.7: osd_enable_op_tracker = 'true'
2022-01-13T02:56:15.229 DEBUG:teuthology.orchestra.run:got remote process result: 6
2022-01-13T02:56:15.230 INFO:tasks.thrashosds.thrasher:Failed to tell all osds, ignoring

Actions

Copy link

Updated by Laura Flores over 2 years ago

Related to Bug #52562: Thrashosds read error injection failed with error ENXIO added

Actions

Copy link

Updated by Laura Flores about 2 years ago

/a/yuriw-2022-01-27_15:09:25-rados-wip-yuri6-testing-2022-01-26-1547-distro-default-smithi/6644093

2022-01-28T06:29:30.972 INFO:teuthology.orchestra.run.smithi070.stderr:osd.0: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.972 INFO:teuthology.orchestra.run.smithi070.stderr:osd.1: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:Error ENXIO: problem getting command descriptions from osd.2
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.3: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.4: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.973 INFO:teuthology.orchestra.run.smithi070.stderr:osd.5: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:teuthology.orchestra.run.smithi070.stderr:osd.6: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:teuthology.orchestra.run.smithi070.stderr:osd.7: osd_enable_op_tracker = 'false'
2022-01-28T06:29:30.974 INFO:tasks.thrashosds.thrasher:Failed to tell all osds, ignoring

Actions

Copy link

Updated by Laura Flores about 2 years ago

/a/benhanokh-2022-01-26_21:12:05-rados-WIP_GBH_NCB_new_alloc_map_A6-distro-basic-smithi/6642122

Actions

Copy link

Updated by Kamoltat (Junior) Sirivadhna about 2 years ago

/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6718855

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 years ago

Hmm, the last reoccurance (/a/yuriw-2022-03-04_00:56:58-rados-wip-yuri4-testing-2022-03-03-1448-distro-default-smithi/6718855) seems to be a dead job actually. Is there something I'm missing here?

Actions

Copy link

Updated by Laura Flores over 1 year ago

Status changed from New to Duplicate

Same failure on test_cls_2pc_queue.sh, but this one came with remote logs. I suspect this is a duplicate of #55809.

/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939513

<error>
  <unique>0x105450d</unique>
  <tid>1</tid>
  <threadname>ceph-mon</threadname>
  <kind>Leak_IndirectlyLost</kind>
  <xwhat>
    <text>24 bytes in 1 blocks are indirectly lost in loss record 6 of 33</text>
    <leakedbytes>24</leakedbytes>
    <leakedblocks>1</leakedblocks>
  </xwhat>
  <stack>
    <frame>
      <ip>0x4C38B6F</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>operator new[](unsigned long)</fn>
      <dir>/builddir/build/BUILD/valgrind-3.19.0/coregrind/m_replacemalloc</dir>
      <file>vg_replace_malloc.c</file>
      <line>640</line>
    </frame>
    <frame>
      <ip>0x543382B</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>ceph::buffer::v15_2_0::ptr_node::create_hypercombined(ceph::unique_leakable_ptr&lt;ceph::buffer::v15_2_0::raw&gt;)</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/common</dir>
      <file>buffer.cc</file>
      <line>2222</line>
    </frame>
    <frame>
      <ip>0x53DD153</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>create</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/include</dir>
      <file>buffer.h</file>
      <line>376</line>
    </frame>
    <frame>
      <ip>0x53DD153</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>ProtocolV2::read_frame()</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>ProtocolV2.cc</file>
      <line>1063</line>
    </frame>
    <frame>
      <ip>0x53E153B</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>ProtocolV2::run_continuation(Ct&lt;ProtocolV2&gt;&amp;)</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>ProtocolV2.cc</file>
      <line>47</line>
    </frame>
    <frame>
      <ip>0x53E171E</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>ProtocolV2::read_event()</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>ProtocolV2.cc</file>
      <line>479</line>
    </frame>
    <frame>
      <ip>0x53A975B</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>AsyncConnection::process()</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>AsyncConnection.cc</file>
      <line>464</line>
    </frame>
    <frame>
      <ip>0x54043E5</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>EventCenter::process_events(unsigned int, std::chrono::duration&lt;unsigned long, std::ratio&lt;1l, 1000000000l&gt; &gt;*)</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>Event.cc</file>
      <line>449</line>
    </frame>
    <frame>
      <ip>0x540A06B</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>operator()</fn>
      <dir>/usr/src/debug/ceph-16.2.9-508.gaf36a7f8.el8.x86_64/src/msg/async</dir>
      <file>Stack.cc</file>
      <line>53</line>
    </frame>
    <frame>
      <ip>0x540A06B</ip>
      <obj>/usr/lib64/ceph/libceph-common.so.2</obj>
      <fn>std::_Function_handler&lt;void (), NetworkStack::add_thread(unsigned int)::{lambda()#1}&gt;::_M_invoke(std::_Any_data const&amp;)</fn>
      <dir>/usr/include/c++/8/bits</dir>
      <file>std_function.h</file>
      <line>297</line>
    </frame>
    <frame>
      <ip>0x10527BA2</ip>
      <obj>/usr/lib64/libstdc++.so.6.0.25</obj>
    </frame>
    <frame>
      <ip>0xFB7B1C9</ip>
      <obj>/usr/lib64/libpthread-2.28.so</obj>
      <fn>start_thread</fn>
    </frame>
    <frame>
      <ip>0x10DCDDD2</ip>
      <obj>/usr/lib64/libc-2.28.so</obj>
      <fn>clone</fn>
    </frame>
  </stack>
</error>

Actions

Copy link

Updated by Laura Flores over 1 year ago

Related to deleted (Bug #52562: Thrashosds read error injection failed with error ENXIO)

Actions

Copy link

Updated by Laura Flores over 1 year ago

Is duplicate of Bug #55809: "Leak_IndirectlyLost" valgrind report on mon.c added

Actions

Copy link

#10

Updated by Nitzan Mordechai 6 months ago

/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441321
/a/yuriw-2023-10-30_15:34:36-rados-wip-yuri10-testing-2023-10-27-0804-distro-default-smithi/7441167

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #53767

qa/workunits/cls/test_cls_2pc_queue.sh: killing an osd during thrashing causes timeout

Updated by Laura Flores over 2 years ago

Updated by Laura Flores over 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Laura Flores about 2 years ago

Updated by Kamoltat (Junior) Sirivadhna about 2 years ago

Updated by Radoslaw Zarzynski about 2 years ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Laura Flores over 1 year ago

Updated by Nitzan Mordechai 6 months ago