Project

General

Profile

Bug #55001

rados/test.sh: Early exit right after LibRados global tests complete

Added by Laura Flores almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This failure was previously tracked in issue #50042, but it has come up enough that it warrants its own Tracker. See issue #50042 for earlier instances.

/a/yuriw-2022-03-19_14:37:23-rados-wip-yuri6-testing-2022-03-18-1104-distro-default-smithi/6747001

2022-03-19T19:56:53.724 INFO:tasks.workunit.client.0.smithi053.stdout:              api_tier_pp: [----------] Global test environment tear-down
2022-03-19T19:56:53.725 INFO:tasks.workunit.client.0.smithi053.stdout:              api_tier_pp: [==========] 77 tests from 4 test suites ran. (3861916 ms total)
2022-03-19T19:56:53.725 INFO:tasks.workunit.client.0.smithi053.stdout:              api_tier_pp: [  PASSED  ] 77 tests.
2022-03-19T19:56:53.726 INFO:tasks.workunit.client.0.smithi053.stderr:+ exit 1
2022-03-19T19:56:53.726 INFO:tasks.workunit.client.0.smithi053.stderr:+ cleanup
2022-03-19T19:56:53.726 INFO:tasks.workunit.client.0.smithi053.stderr:+ pkill -P 48582
2022-03-19T19:56:53.732 INFO:tasks.workunit.client.0.smithi053.stderr:+ true
2022-03-19T19:56:53.733 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-03-19T19:56:53.734 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...
2022-03-19T19:56:53.734 DEBUG:teuthology.orchestra.run.smithi053:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-03-19T19:56:53.776 INFO:tasks.ceph.osd.1.smithi053.stderr:2022-03-19T19:56:53.773+0000 ee74700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.1.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 1  (PID: 268905) UID: 0
2022-03-19T19:56:53.778 INFO:teuthology.orchestra.run.smithi053.stdout:{
2022-03-19T19:56:53.778 INFO:teuthology.orchestra.run.smithi053.stdout:    "success": "" 
2022-03-19T19:56:53.779 INFO:teuthology.orchestra.run.smithi053.stdout:}
2022-03-19T19:56:53.843 INFO:tasks.ceph.osd.5.smithi125.stderr:2022-03-19T19:56:53.840+0000 ee74700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.5.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 5  (PID: 107842) UID: 0
2022-03-19T19:56:53.861 INFO:teuthology.orchestra.run.smithi053.stderr:2022-03-19T19:56:53.858+0000 7f4aeaffd700  1 -- 172.21.15.53:0/1210551706 <== mon.1 v2:172.21.15.125:3300/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+162503 (secure 0 0 0) 0x7f4aec04fc60 con 0x7f4b040ae240
2022-03-19T19:56:53.862 INFO:teuthology.orchestra.run.smithi053.stderr:2022-03-19T19:56:53.858+0000 7f4aeaffd700 10 monclient: handle_mon_command_ack 1 [{"prefix": "get_command_descriptions"}]
2022-03-19T19:56:53.862 INFO:teuthology.orchestra.run.smithi053.stderr:2022-03-19T19:56:53.858+0000 7f4aeaffd700 10 monclient: _finish_command 1 = system:0
2022-03-19T19:56:53.912 INFO:teuthology.orchestra.run.smithi053.stdout:ERROR: (22) Invalid argument
2022-03-19T19:56:53.912 INFO:teuthology.orchestra.run.smithi053.stdout:op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
2022-03-19T19:56:53.925 DEBUG:teuthology.orchestra.run.smithi125:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.6.asok dump_ops_in_flight
2022-03-19T19:56:53.944 INFO:tasks.ceph.osd.0.smithi053.stderr:2022-03-19T19:56:53.940+0000 ee74700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.0.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 0  (PID: 373503) UID: 0
2022-03-19T19:56:54.039 ERROR:teuthology.run_tasks:Saw exception from tasks.


Related issues

Related to RADOS - Bug #50042: rados/test.sh: api_watch_notify failures Resolved
Copied to RADOS - Backport #57029: pacific: rados/test.sh: Early exit right after LibRados global tests complete Resolved
Copied to RADOS - Backport #57030: quincy: rados/test.sh: Early exit right after LibRados global tests complete Resolved

History

#1 Updated by Laura Flores almost 2 years ago

  • Related to Bug #50042: rados/test.sh: api_watch_notify failures added

#2 Updated by Radoslaw Zarzynski almost 2 years ago

I think we can't be sure the timeout is because of api_tier_pp. To be conclusive we need to check who had the PID 48582.

#3 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-06-10_03:10:47-rados-wip-yuri4-testing-2022-06-09-1510-quincy-distro-default-smithi/6871955

Coredump on smithi060, but no evidence of any kind of assertion/segfault failures in the teuthology log.

#4 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-06-29_13:30:16-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6905499

#5 Updated by Matan Breizman over 1 year ago

/a/yuriw-2022-07-15_19:06:53-rados-wip-yuri-testing-2022-07-15-0950-octopus-distro-default-smithi/6932690

#6 Updated by Nitzan Mordechai over 1 year ago

We have coredump and the console_log showing:
smithi042.log:[ 852.382596] ceph_test_rados110223: segfault at 0 ip 00007f26e7643e8e sp 00007ffeef9a9460 error 6 in librados.so.2.0.0[7f26e75f7000+15a000]

checking that

#7 Updated by Nitzan Mordechai over 1 year ago

  • Assignee set to Nitzan Mordechai

#8 Updated by Nitzan Mordechai over 1 year ago

the core dump showing:

#0  librados::IoCtxImpl::put (this=0x0) at /usr/include/c++/8/bits/atomic_base.h:303
#1  _rados_ioctx_destroy (io=0x0) at /usr/src/debug/ceph-17.2.0-434.g718f6c9e.el8.x86_64/src/librados/librados_c.cc:1177
#2  0x000055f5ae16b101 in RadosTestEC::TearDown (this=0x7fcba8075c40) at /usr/src/debug/ceph-17.2.0-434.g718f6c9e.el8.x86_64/src/test/librados/TestCase.cc:188

There are 2 issues:
1. the Tests doesn't check that ioctx was initialized
2. the C api doesn't check that ioctx initilized before calling the ctx->put()

extern "C" void LIBRADOS_C_API_DEFAULT_F(rados_ioctx_destroy)(rados_ioctx_t io)
{
  tracepoint(librados, rados_ioctx_destroy_enter, io);
  librados::IoCtxImpl *ctx = (librados::IoCtxImpl *)io;
  ctx->put();
  tracepoint(librados, rados_ioctx_destroy_exit);
}

unlike the c++ api:

void librados::IoCtx::close()
{
  if (io_ctx_impl)
    io_ctx_impl->put();
  io_ctx_impl = 0;
}

#9 Updated by Nitzan Mordechai over 1 year ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 47165

#10 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/yuriw-2022-07-22_03:30:40-rados-wip-yuri3-testing-2022-07-21-1604-distro-default-smithi/6943763/

#12 Updated by Laura Flores over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to quincy,pacific

#13 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57029: pacific: rados/test.sh: Early exit right after LibRados global tests complete added

#14 Updated by Backport Bot over 1 year ago

  • Copied to Backport #57030: quincy: rados/test.sh: Early exit right after LibRados global tests complete added

#15 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#16 Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • Tags deleted (backport_processed)

Also available in: Atom PDF