Project

General

Profile

Actions

Bug #64726

open

LibRadosAioEC.MultiWritePP hang and pkill

Added by Laura Flores about 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2024-03-04_20:52:58-rados-reef-release-distro-default-smithi/7581519

2024-03-04T22:46:09.484 INFO:tasks.workunit.client.0.smithi081.stdout:               api_aio_pp: [ RUN      ] LibRadosAioEC.MultiWritePP
2024-03-04T22:46:09.485 INFO:tasks.workunit.client.0.smithi081.stderr:bash: line 1: 40556 Alarm clock             ceph_test_rados_api_aio_pp 2>&1
2024-03-04T22:46:09.485 INFO:tasks.workunit.client.0.smithi081.stderr:     40559 Done                    | tee ceph_test_rados_api_aio_pp.log
2024-03-04T22:46:09.485 INFO:tasks.workunit.client.0.smithi081.stderr:     40560 Done                    | sed "s/^/               api_aio_pp: /" 
2024-03-04T23:12:14.647 INFO:tasks.ceph.mon.b.smithi143.stderr:==00:01:19:34.643 34496==
2024-03-04T23:12:14.647 INFO:tasks.ceph.mon.b.smithi143.stderr:==00:01:19:34.643 34496== Exit program on first error (--exit-on-first-error=yes)
2024-03-04T23:12:14.747 INFO:tasks.ceph.mon.b.smithi143.stderr:daemon-helper: command failed with exit status 42
2024-03-05T03:54:47.728 INFO:tasks.workunit.client.0.smithi081.stderr:++ cleanup
2024-03-05T03:54:47.778 INFO:tasks.workunit.client.0.smithi081.stderr:++ pkill -P 40463
2024-03-05T03:54:47.828 DEBUG:teuthology.orchestra.run:got remote process result: 124
2024-03-05T03:54:47.828 INFO:tasks.workunit.client.0.smithi081.stderr:++ true
2024-03-05T03:54:47.829 INFO:tasks.workunit:Stopping ['rados/test.sh'] on client.0...

Actions #1

Updated by Laura Flores about 2 months ago

  • Related to Bug #58130: LibRadosAio.SimpleWrite hang and pkill added
Actions #2

Updated by Nitzan Mordechai about 2 months ago

last op that LibRadosAioEC.MultiWritePP trying to do is writing the oid_MultiWritePP_ obj:

ceph-client.admin.40556.log.gz:2024-03-04T22:06:21.095+0000 7f190bfff640 20 client.15906.objecter 1    188.2927bfdf    osd.7    oid_MultiWritePP_    [write 0~128 in=128b]
ceph-client.admin.40556.log.gz:2024-03-04T22:09:14.321+0000 7f190bfff640  1 -- 172.21.15.81:0/4069621609 --> [v2:172.21.15.143:6816/35047,v1:172.21.15.143:6817/35047] -- osd_op(unknown.0.0:1 188.7s2 188:fbfde494:::oid_MultiWritePP_:head [write 0~128 in=128b] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected+supports_pool_eio e412) v8 -- 0x7f1910013bf0 con 0x55a1c5c542c0

at that point, osd.7 was down and didn't go up. this is EC, we still waiting for it. thats the reason the Alarm poped.

the teuthology.log show why it went down:

2024-03-04T22:09:00.833 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.5 is failed for ~298s
2024-03-04T22:09:07.043 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.osd.5 is failed for ~304s
2024-03-04T22:09:07.043 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
2024-03-04T22:09:08.252 INFO:tasks.ceph.osd.0:Sent signal 15
2024-03-04T22:09:08.253 INFO:tasks.ceph.osd.1:Sent signal 15
2024-03-04T22:09:08.253 INFO:tasks.ceph.osd.2:Sent signal 15
2024-03-04T22:09:08.253 INFO:tasks.ceph.osd.3:Sent signal 15
2024-03-04T22:09:08.254 INFO:tasks.ceph.osd.4:Sent signal 15
2024-03-04T22:09:08.254 INFO:tasks.ceph.osd.6:Sent signal 15
2024-03-04T22:09:08.254 INFO:tasks.ceph.osd.7:Sent signal 15
2024-03-04T22:09:08.254 INFO:tasks.ceph.mon.a:Sent signal 15
2024-03-04T22:09:08.255 INFO:tasks.ceph.mon.c:Sent signal 15
2024-03-04T22:09:08.255 INFO:tasks.ceph.mon.b:Sent signal 15
2024-03-04T22:09:08.255 INFO:tasks.ceph.mgr.y:Sent signal 15
2024-03-04T22:09:08.255 INFO:tasks.ceph.mgr.x:Sent signal 15

osd.5 failed because it had valgrind leaks:

<error>
  <unique>0x241c8</unique>
  <tid>1</tid>
  <kind>Leak_PossiblyLost</kind>
  <xwhat>
    <text>32 bytes in 1 blocks are possibly lost in loss record 1,913 of 6,379</text>
    <leakedbytes>32</leakedbytes>
    <leakedblocks>1</leakedblocks>
  </xwhat>
  <stack>
    <frame>
      <ip>0x484982C</ip>
      <obj>/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so</obj>
      <fn>calloc</fn>
      <dir>/builddir/build/BUILD/valgrind-3.21.0/coregrind/m_replacemalloc</dir>
      <file>vg_replace_malloc.c</file>
      <line>1554</line>
    </frame>
    <frame>
      <ip>0x5A6C4E5</ip>
      <obj>/usr/lib64/libnl-3.so.200.26.0</obj>
      <fn>__trans_list_add</fn>
    </frame>
    <frame>
      <ip>0x59E5E6E</ip>
      <obj>/usr/lib64/libnl-route-3.so.200.26.0</obj>
    </frame>
    <frame>

Actions #3

Updated by Nitzan Mordechai about 2 months ago

  • Related to deleted (Bug #58130: LibRadosAio.SimpleWrite hang and pkill)
Actions #4

Updated by Nitzan Mordechai about 2 months ago

  • Assignee set to Nitzan Mordechai

removed the "Related issues"

Actions #5

Updated by Radoslaw Zarzynski about 2 months ago

I think the direct reason behind the test's hang is the death of osd.5:

2024-03-04T22:03:54.749 INFO:tasks.ceph.osd.5.smithi143.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/
gigantic/release/18.2.1-8-gc103bed5/rpm/el9/BUILD/ceph-18.2.1-8-gc103bed5/src/messages/MOSDRepOp.h:127: virtual void MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed
.
2024-03-04T22:03:54.802 INFO:tasks.ceph.osd.5.smithi143.stderr:*** Caught signal (Aborted) **
2024-03-04T22:03:54.802 INFO:tasks.ceph.osd.5.smithi143.stderr: in thread 28e96640 thread_name:tp_osd_tp
2024-03-04T22:03:54.870 INFO:tasks.ceph.osd.5.smithi143.stderr: ceph version 18.2.1-8-gc103bed5 (c103bed50be19710125780ba883a92de4e7944f0) reef (stable)
2024-03-04T22:03:54.870 INFO:tasks.ceph.osd.5.smithi143.stderr: 1: /lib64/libc.so.6(+0x54db0) [0x549fdb0]
2024-03-04T22:03:54.870 INFO:tasks.ceph.osd.5.smithi143.stderr: 2: /lib64/libc.so.6(+0xa154c) [0x54ec54c]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 3: raise()
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 4: abort()
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 5: /lib64/libc.so.6(+0x2871b) [0x547371b]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 6: /lib64/libc.so.6(+0x4dca6) [0x5498ca6]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 7: ceph-osd(+0x8a9a78) [0x9b1a78]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 8: (Message::encode(unsigned long, int, bool)+0x2e) [0xcc078e]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 9: (ProtocolV2::send_message(Message*)+0x241) [0xe58111]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 10: (AsyncConnection::send_message(Message*)+0x266) [0xe3bbe6]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x129) [0x68a429]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std
::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x70d) [0x9c89ad]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::defau
lt_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boo
st::intrusive_ptr<OpRequest>)+0x67a) [0x9c91da]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x36f) [0x7d0faf]
2024-03-04T22:03:54.871 INFO:tasks.ceph.osd.5.smithi143.stderr: 15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x57) [0x7d6d97]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 16: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x519) [0x80ee09]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 17: ceph-osd(+0x554591) [0x65c591]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 18: ceph-osd(+0xe6ff4b) [0xf77f4b]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 19: (PrimaryLogPG::snap_trimmer(unsigned int)+0xc8) [0x7a3a58]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 20: (ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1f) [0x8d069f]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 21: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xd67) [0x6f1517]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 22: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x25b) [0xba604b]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 23: ceph-osd(+0xa9e5b4) [0xba65b4]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 24: /lib64/libc.so.6(+0x9f802) [0x54ea802]
2024-03-04T22:03:54.872 INFO:tasks.ceph.osd.5.smithi143.stderr: 25: clone()

It looks like a duplicate of https://tracker.ceph.com/issues/57845 which in turn has the same underlying root cause as https://tracker.ceph.com/issues/52657.

Actions #6

Updated by Radoslaw Zarzynski about 2 months ago

  void encode_payload(uint64_t features) override {
    using ceph::encode;
    encode(map_epoch, payload);
    assert(HAVE_FEATURE(features, SERVER_OCTOPUS));
    // ...
Actions

Also available in: Atom PDF