Project

General

Profile

Actions

Bug #52657

open

MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)'

Added by Ronen Friedman over 2 years ago. Updated 3 days ago.

Status:
Fix Under Review
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Test: rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-recovery} 3-scrub-overrides/{max-simultaneous-scrubs-3} backoff/normal ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/crush-compat mon_election/connectivity msgr-failures/few msgr/async objectstore/filestore-xfs rados supported-random-distro$/{ubuntu_latest} thrashers/pggrow thrashosds-health workloads/admin_socket_objecter_requests}

Log: http://qa-proxy.ceph.com/teuthology/rfriedma-2021-09-18_10:17:46-rados-wip-rf-scrub-locations-distro-basic-smithi/6396045/teuthology.log

(gdb) bt
#0 raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x000055d0a10b34fa in ?? ()
#2 <signal handler called>
#3 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#4 0x00007f38694a7859 in __GI_abort () at abort.c:79
#5 0x00007f38694a7729 in __assert_fail_base (fmt=0x7f386963d588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55d0a1b80ac8 "HAVE_FEATURE(features, SERVER_NAUTILUS)",
file=0x55d0a1b80a88 "/build/ceph-17.0.0-7727-g52c52ec4/src/messages/MOSDPGLog.h", line=104, function=<optimized out>) at assert.c:92
#6 0x00007f38694b8f36 in __GI___assert_fail (assertion=0x55d0a1b80ac8 "HAVE_FEATURE(features, SERVER_NAUTILUS)", file=0x55d0a1b80a88 "/build/ceph-17.0.0-7727-g52c52ec4/src/messages/MOSDPGLog.h", line=104,
function=0x55d0a1b80a50 "virtual void MOSDPGLog::encode_payload(uint64_t)") at assert.c:101
#7 0x000055d0a09b38a1 in MOSDPGLog::encode_payload(unsigned long) ()
#8 0x000055d0a124d45e in Message::encode(unsigned long, int, bool) ()
#9 0x000055d0a14fb0b4 in ProtocolV2::prepare_send_message(unsigned long, Message*) ()
#10 0x000055d0a14fb7ae in ProtocolV2::send_message(Message*) ()
#11 0x000055d0a14cc90e in AsyncConnection::send_message(Message*) ()
#12 0x000055d0a0a1d7f4 in PG::send_cluster_message(int, boost::intrusive_ptr<Message>, unsigned int, bool) ()
#13 0x000055d0a0ca86fc in PeeringState::fulfill_log(pg_shard_t, pg_query_t const&, unsigned int) ()
#14 0x000055d0a0ca9386 in PeeringState::fulfill_query(MQuery const&, PeeringCtxWrapper&) ()
#15 0x000055d0a0ca94b0 in PeeringState::Stray::react(MQuery const&) ()
#16 0x000055d0a0d06a2f in boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*) ()
#17 0x000055d0a0a472c4 in boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&) ()
#18 0x000055d0a0a3315a in PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&) ()
#19 0x000055d0a09a64cd in OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&) ()
#20 0x000055d0a0c77c86 in ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) ()
#21 0x000055d0a0990eda in OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) ()
#22 0x000055d0a11015f3 in ShardedThreadPool::shardedthreadpool_worker(unsigned int) ()
#23 0x000055d0a1104414 in ShardedThreadPool::WorkThreadSharded::entry() ()
#24 0x00007f3869a37609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#25 0x00007f38695a4293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95


Related issues 3 (2 open1 closed)

Related to RADOS - Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed.New

Actions
Related to RADOS - Bug #57845: MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failedNew

Actions
Has duplicate RADOS - Bug #52694: src/messages/MOSDPGLog.h: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.Duplicate

Actions
Actions #1

Updated by Patrick Donnelly over 2 years ago

  • Project changed from CephFS to RADOS
Actions #2

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52694: src/messages/MOSDPGLog.h: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed. added
Actions #3

Updated by Neha Ojha over 2 years ago

  • Priority changed from Low to Urgent
2021-10-30T08:10:22.286 INFO:tasks.ceph.osd.2.smithi023.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-8627-g9f28acf7/rpm/el8/BUILD/ceph-17.0.0-8627-g9f28acf7/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.
2021-10-30T08:10:22.286 INFO:tasks.ceph.osd.2.smithi023.stderr:*** Caught signal (Aborted) **
2021-10-30T08:10:22.286 INFO:tasks.ceph.osd.2.smithi023.stderr: in thread 7f69f082f700 thread_name:tp_osd_tp
2021-10-30T08:10:22.293 INFO:tasks.ceph.osd.2.smithi023.stderr: ceph version 17.0.0-8627-g9f28acf7 (9f28acf795549b8e096fcdbf3f5b63592731a352) quincy (dev)
2021-10-30T08:10:22.294 INFO:tasks.ceph.osd.2.smithi023.stderr: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f6a20254b20]
2021-10-30T08:10:22.294 INFO:tasks.ceph.osd.2.smithi023.stderr: 2: gsignal()
2021-10-30T08:10:22.294 INFO:tasks.ceph.osd.2.smithi023.stderr: 3: abort()
2021-10-30T08:10:22.294 INFO:tasks.ceph.osd.2.smithi023.stderr: 4: /lib64/libc.so.6(+0x21c89) [0x7f6a1ee9ec89]
2021-10-30T08:10:22.295 INFO:tasks.ceph.osd.2.smithi023.stderr: 5: /lib64/libc.so.6(+0x2fa76) [0x7f6a1eeaca76]
2021-10-30T08:10:22.295 INFO:tasks.ceph.osd.2.smithi023.stderr: 6: (MOSDPGLog::encode_payload(unsigned long)+0x389) [0x559ef2f98639]
2021-10-30T08:10:22.295 INFO:tasks.ceph.osd.2.smithi023.stderr: 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x559ef379194e]
2021-10-30T08:10:22.295 INFO:tasks.ceph.osd.2.smithi023.stderr: 8: (ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x559ef3a15f44]
2021-10-30T08:10:22.295 INFO:tasks.ceph.osd.2.smithi023.stderr: 9: (ProtocolV2::send_message(Message*)+0x3ae) [0x559ef3a1659e]
2021-10-30T08:10:22.296 INFO:tasks.ceph.osd.2.smithi023.stderr: 10: (AsyncConnection::send_message(Message*)+0x536) [0x559ef39f2c66]
2021-10-30T08:10:22.296 INFO:tasks.ceph.osd.2.smithi023.stderr: 11: (PG::send_cluster_message(int, boost::intrusive_ptr<Message>, unsigned int, bool)+0xa5) [0x559ef2ff8a15]
2021-10-30T08:10:22.297 INFO:tasks.ceph.osd.2.smithi023.stderr: 12: (PeeringState::fulfill_log(pg_shard_t, pg_query_t const&, unsigned int)+0x2b4) [0x559ef324bd84]
2021-10-30T08:10:22.297 INFO:tasks.ceph.osd.2.smithi023.stderr: 13: (PeeringState::fulfill_query(MQuery const&, PeeringCtxWrapper&)+0x15e) [0x559ef324c48e]
2021-10-30T08:10:22.297 INFO:tasks.ceph.osd.2.smithi023.stderr: 14: (PeeringState::Stray::react(MQuery const&)+0x37) [0x559ef324c567]
2021-10-30T08:10:22.297 INFO:tasks.ceph.osd.2.smithi023.stderr: 15: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x231) [0x559ef329e841]
2021-10-30T08:10:22.298 INFO:tasks.ceph.osd.2.smithi023.stderr: 16: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x559ef301b254]
2021-10-30T08:10:22.298 INFO:tasks.ceph.osd.2.smithi023.stderr: 17: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d6) [0x559ef300f4d6]
2021-10-30T08:10:22.298 INFO:tasks.ceph.osd.2.smithi023.stderr: 18: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x175) [0x559ef2f85955]
2021-10-30T08:10:22.299 INFO:tasks.ceph.osd.2.smithi023.stderr: 19: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x559ef321c086]
2021-10-30T08:10:22.299 INFO:tasks.ceph.osd.2.smithi023.stderr: 20: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xac8) [0x559ef2f779b8]
2021-10-30T08:10:22.299 INFO:tasks.ceph.osd.2.smithi023.stderr: 21: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x559ef3667fb4]
2021-10-30T08:10:22.299 INFO:tasks.ceph.osd.2.smithi023.stderr: 22: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x559ef3668aa4]
2021-10-30T08:10:22.299 INFO:tasks.ceph.osd.2.smithi023.stderr: 23: /lib64/libpthread.so.0(+0x814a) [0x7f6a2024a14a]
2021-10-30T08:10:22.300 INFO:tasks.ceph.osd.2.smithi023.stderr: 24: clone()

/a/yuriw-2021-10-29_23:44:51-rados-wip-yuri-master-10.29.21-distro-basic-smithi/6468209

rados/thrash-erasure-code/{ceph clusters/{fixed-2 openstack} fast/normal mon_election/connectivity msgr-failures/few objectstore/filestore-xfs rados recovery-overrides/{more-active-recovery} supported-random-distro$/{centos_8.stream} thrashers/fastread thrashosds-health workloads/ec-radosbench}

Actions #4

Updated by Neha Ojha over 2 years ago

  • Subject changed from QA: test asserts on HAVE_FEATURE(features, SERVER_NAUTILUS) to MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)'
Actions #5

Updated by Aishwarya Mathuria over 2 years ago

  • Assignee set to Aishwarya Mathuria
Actions #6

Updated by Neha Ojha over 2 years ago

/a/ksirivad-2021-11-19_19:14:07-rados-wip-autoscale-profile-scale-up-default-distro-basic-smithi/6514251

Actions #7

Updated by Josh Durgin over 2 years ago

  • Related to Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. added
Actions #8

Updated by Neha Ojha over 2 years ago

  • Related to Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. added
Actions #9

Updated by Neha Ojha over 2 years ago

  • Related to deleted (Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed.)
Actions #10

Updated by Aishwarya Mathuria about 2 years ago

  • Status changed from New to In Progress
Actions #11

Updated by Neha Ojha about 2 years ago

/ceph/teuthology-archive/yuriw-2022-02-15_22:35:42-rados-wip-yuri8-testing-2022-02-15-1214-distro-default-smithi/6686017/remote/smithi148/coredump - there is a coredump for this

Actions #12

Updated by Laura Flores almost 2 years ago

This one looks somewhat different from the other reported failures. First of all, it failed on a rados/verify test, not a rados/thrash test:

Description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

/a/yuriw-2022-04-25_14:14:44-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6805194

The first assert hit was the "HAVE_FEATURE(features, SEVER_NAUTILUS)" failure:

2022-04-25T14:58:51.945 INFO:tasks.ceph.osd.1.smithi066.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-58-ge2161634/rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.
2022-04-25T14:58:51.945 INFO:tasks.ceph.osd.1.smithi066.stderr:*** Caught signal (Aborted) **
2022-04-25T14:58:51.946 INFO:tasks.ceph.osd.1.smithi066.stderr: in thread 2ed2d700 thread_name:tp_osd_tp
2022-04-25T14:58:51.946 INFO:tasks.ceph.osd.4.smithi117.stderr:2022-04-25T14:58:51.772+0000 ee74700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper term env OPENSSL_ia32cap=~0x1000000000000000 valgrind --trace-children=no --child-silent-after-fork=yes --soname-synonyms=somalloc=*tcmalloc* --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.4.log --time-stamp=yes --vgdb=yes --exit-on-first-error=yes --error-exitcode=42 --tool=memcheck ceph-osd -f --cluster ceph -i 4  (PID: 26088) UID: 0
2022-04-25T14:58:51.947 INFO:teuthology.orchestra.run.smithi066.stderr:2022-04-25T14:58:51.761+0000 7f37585c3700 10 start_mon_command cmd=[{"prefix": "osd set", "key": "noscrub"}]
2022-04-25T14:58:51.947 INFO:teuthology.orchestra.run.smithi066.stderr:2022-04-25T14:58:51.761+0000 7f37585c3700 10 monclient: _send_command 2 [{"prefix": "osd set", "key": "noscrub"}]
2022-04-25T14:58:51.947 INFO:teuthology.orchestra.run.smithi066.stderr:2022-04-25T14:58:51.761+0000 7f37585c3700 10 monclient: _send_mon_message to mon.c at v2:172.21.15.66:3301/0
2022-04-25T14:58:51.947 INFO:teuthology.orchestra.run.smithi066.stderr:2022-04-25T14:58:51.761+0000 7f37585c3700  1 -- 172.21.15.66:0/221244016 --> [v2:172.21.15.66:3301/0,v1:172.21.15.66:6790/0] -- mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1 -- 0x7f375013a210 con 0x7f37500a6260
2022-04-25T14:58:51.948 INFO:tasks.ceph.osd.1.smithi066.stderr: ceph version 17.2.0-58-ge2161634 (e21616343a03209745d1ef7ffb6304159addf4fd) quincy (stable)
2022-04-25T14:58:51.948 INFO:tasks.ceph.osd.1.smithi066.stderr: 1: /lib64/libpthread.so.0(+0x12ce0) [0x682dce0]
2022-04-25T14:58:51.948 INFO:tasks.ceph.osd.1.smithi066.stderr: 2: gsignal()
2022-04-25T14:58:51.948 INFO:tasks.ceph.osd.1.smithi066.stderr: 3: abort()
2022-04-25T14:58:51.949 INFO:tasks.ceph.osd.1.smithi066.stderr: 4: /lib64/libc.so.6(+0x21c89) [0x7a5dc89]
2022-04-25T14:58:51.949 INFO:tasks.ceph.osd.1.smithi066.stderr: 5: /lib64/libc.so.6(+0x473a6) [0x7a833a6]
2022-04-25T14:58:51.949 INFO:tasks.ceph.osd.1.smithi066.stderr: 6: (MOSDPGLog::encode_payload(unsigned long)+0x389) [0x82a4e9]
2022-04-25T14:58:51.949 INFO:tasks.ceph.osd.1.smithi066.stderr: 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x103db9e]
2022-04-25T14:58:51.949 INFO:tasks.ceph.osd.1.smithi066.stderr: 8: (ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x12c3a94]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 9: (ProtocolV2::send_message(Message*)+0x3ae) [0x12c40ee]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 10: (AsyncConnection::send_message(Message*)+0x53e) [0x12a079e]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 11: (PG::send_cluster_message(int, boost::intrusive_ptr<Message>, unsigned int, bool)+0xa5) [0x88b5e5]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 12: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x17f0) [0xae0740]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 13: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x336) [0xb02176]
2022-04-25T14:58:51.950 INFO:tasks.ceph.osd.1.smithi066.stderr: 14: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Active, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb6) [0xb32a36]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 15: (boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0xb32eca]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 16: (boost::statechart::simple_state<PeeringState::GetMissing, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb9) [0xb332e9]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 17: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x16a) [0x8adf2a]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 18: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d6) [0x8a2086]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 19: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x175) [0x8175a5]
2022-04-25T14:58:51.951 INFO:tasks.ceph.osd.1.smithi066.stderr: 20: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0xaaeaa6]
2022-04-25T14:58:51.952 INFO:tasks.ceph.osd.1.smithi066.stderr: 21: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x8092c8]
2022-04-25T14:58:51.952 INFO:tasks.ceph.osd.1.smithi066.stderr: 22: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0xf0ef44]
2022-04-25T14:58:51.952 INFO:tasks.ceph.osd.1.smithi066.stderr: 23: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0xf102e4]
2022-04-25T14:58:51.952 INFO:tasks.ceph.osd.1.smithi066.stderr: 24: /lib64/libpthread.so.0(+0x81cf) [0x68231cf]
2022-04-25T14:58:51.952 INFO:tasks.ceph.osd.1.smithi066.stderr: 25: clone()

After the Traceback, it hits another assert failure:

rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/mon/Monitor.cc: In function 'virtual Monitor::~Monitor()' thread 4044dc0 time 2022-04-25T15:04:03.611071+0000
2022-04-25T15:04:03.616 INFO:tasks.ceph.mon.a.smithi066.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-58-ge2161634/rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/mon/Monitor.cc: 285: FAILED ceph_assert(session_map.sessions.empty())
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: ceph version 17.2.0-58-ge2161634 (e21616343a03209745d1ef7ffb6304159addf4fd) quincy (stable)
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x50ca024]
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x283245) [0x50ca245]
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 3: (Monitor::~Monitor()+0xb35) [0x424d15]
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 4: (Monitor::~Monitor()+0xd) [0x424d6d]
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 5: main()
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 6: __libc_start_main()
2022-04-25T15:04:03.664 INFO:tasks.ceph.mon.a.smithi066.stderr: 7: _start()

Finally, two LibRadosAioEC.RoundTripPPx tests fail:

2022-04-25T16:49:03.212 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-58-ge2161634/rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/test/librados/aio_cxx.cc:1394: Failure
2022-04-25T16:49:03.212 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: Expected equality of these values:
2022-04-25T16:49:03.212 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   "" 
2022-04-25T16:49:03.212 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   test_data.init()
2022-04-25T16:49:03.212 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:     Which is: "create_one_ec_pool(test-rados-api-smithi066-35670-47) failed: error cluster.connect failed with error -110" 
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: [  FAILED  ] LibRadosAioEC.RoundTripPP2 (300078 ms)
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: [ RUN      ] LibRadosAioEC.RoundTripPP3
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-58-ge2161634/rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/test/librados/aio_cxx.cc:1427: Failure
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: Expected equality of these values:
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   "" 
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   create_one_pool_pp(pool_name, cluster)
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:     Which is: "cluster.connect failed with error -110" 
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: [  FAILED  ] LibRadosAioEC.RoundTripPP3 (3000023 ms)
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: [ RUN      ] LibRadosAioEC.RoundTripSparseReadPP
2022-04-25T16:49:03.213 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-58-ge2161634/rpm/el8/BUILD/ceph-17.2.0-58-ge2161634/src/test/librados/aio_cxx.cc:1529: Failure
2022-04-25T16:49:03.214 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp: Expected equality of these values:
2022-04-25T16:49:03.214 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   "" 
2022-04-25T16:49:03.214 INFO:tasks.workunit.client.0.smithi066.stdout:               api_aio_pp:   test_data.init()

Actions #13

Updated by Radoslaw Zarzynski almost 2 years ago

Let's discuss this on the next RADOS Team Meeting.

Actions #14

Updated by Laura Flores almost 2 years ago

Laura Flores wrote:

This one looks somewhat different from the other reported failures. First of all, it failed on a rados/verify test, not a rados/thrash test:

Description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

/a/yuriw-2022-04-25_14:14:44-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6805194

The first assert hit was the "HAVE_FEATURE(features, SEVER_NAUTILUS)" failure:
[...]

After the Traceback, it hits another assert failure:
[...]

Finally, two LibRadosAioEC.RoundTripPPx tests fail:
[...]

This actually does happen in rados/verify tests: #52694.

Actions #15

Updated by Aishwarya Mathuria almost 2 years ago

Scheduled another run with just the rados/verify test that failed and I can see this happen frequently:

/a/amathuri-2022-04-29_07:12:13-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6812805
/a/amathuri-2022-04-29_07:12:13-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6812788
/a/amathuri-2022-04-29_07:12:13-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6812827

Also saw a few occurrences of https://tracker.ceph.com/issues/53685:
/a/amathuri-2022-04-29_07:12:13-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6812765
/a/amathuri-2022-04-29_07:12:13-rados-wip-yuri3-testing-2022-04-22-0534-quincy-distro-default-smithi/6812798

None of them hit the second assert failure and the LibRadosAioEC tests failure.

Actions #16

Updated by Radoslaw Zarzynski almost 2 years ago

Just the record: we suspect the issue is related to the error injection in async-msgr. Some runs without them are supposed to be scheduled.

Actions #17

Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-05-19_18:50:25-rados-wip-yuri4-testing-2022-05-19-0831-quincy-distro-default-smithi/6841763
Description: rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

2022-05-19T22:55:48.190 INFO:tasks.ceph.osd.6.smithi089.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-214-gb92cfd94/rpm/el8/BUILD/ceph-17.2.0-214-gb92cfd94/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.
2022-05-19T22:55:48.225 INFO:tasks.ceph.osd.6.smithi089.stderr:ceph-osd: /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0-214-gb92cfd94/rpm/el8/BUILD/ceph-17.2.0-214-gb92cfd94/src/messages/MOSDPGLog.h:104: virtual void MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' failed.

Actions #19

Updated by Radoslaw Zarzynski almost 2 years ago

https://tracker.ceph.com/issues/53685 shows the issue is not restricted just to MOSDPGLog.

Actions #20

Updated by Aishwarya Mathuria almost 2 years ago

I have been able to easily reproduce this by running the following test:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/default/{default thrashosds-health} mon_election/connectivity msgr-failures/few msgr/async-v2only objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

However, when I tried to run this test without valgrind I no longer encountered the assertion failure

Actions #21

Updated by Radoslaw Zarzynski almost 2 years ago

  • Priority changed from Urgent to High

Yeah, this clearly looks like a race condition (likely around life time management).

Lowering to High as it happens under Valgrind.

Actions #22

Updated by Radoslaw Zarzynski almost 2 years ago

Hello Aishwarya! How about coworking on this? Ping me when you have time.

Actions #23

Updated by Neha Ojha over 1 year ago

/a/yuriw-2022-09-15_17:53:16-rados-quincy-release-distro-default-smithi/7034203/

Actions #24

Updated by Brad Hubbard over 1 year ago

I was able to gather a coredump and set up a binary compatible environment to debug it from this run Laura started in an attempt to reproduce this issue.

https://pulpito.ceph.com/lflores-2022-10-25_18:29:52-upgrade:quincy-p2p:quincy-p2p-stress-split-quincy-distro-default-smithi/7080821/

(gdb) f 8
#8  0x000055e255fc8791 in MOSDPGLog::encode_payload (this=0x55e259c3c800, features=0) at ./src/include/encoding.h:86
86      WRITE_RAW_ENCODER(__u8)
(gdb) p features
$1 = 0
(gdb) up
#9  0x000055e256888f5e in Message::encode (this=this@entry=0x55e259c3c800, features=features@entry=0, crcflags=crcflags@entry=0, skip_header_crc=skip_header_crc@entry=false) at ./src/msg/Message.cc:234
234         encode_payload(features);
(gdb) p features
$2 = 0
(gdb) up
#10 0x000055e256b3ca44 in ProtocolV2::prepare_send_message (this=<optimized out>, features=0, m=0x55e259c3c800) at ./src/msg/async/ProtocolV2.cc:425
425       m->encode(features, 0);
(gdb) p features
$3 = 0
(gdb) up
#11 0x000055e256b3d13e in ProtocolV2::send_message (this=0x55e261ad0b00, m=0x55e259c3c800) at ./src/msg/async/ProtocolV2.cc:435
435         prepare_send_message(f, m);
(gdb) p f
$9 = 0
(gdb) l ProtocolV2::send_message
423
424       // encode and copy out of *m
425       m->encode(features, 0);
426     }
427
428     void ProtocolV2::send_message(Message *m) {
429       uint64_t f = connection->get_features();
430
431       // TODO: Currently not all messages supports reencode like MOSDMap, so here
432       // only let fast dispatch support messages prepare message
(gdb) p connection->features
$10 = 4540138303579357183
(gdb) p connection          
$11 = (AsyncConnection *) 0x55e25f6ca400

It's not clear at all to me how 4540138303579357183 becomes 0 between the call to connection->get_features() and the call to prepare_send_message() a few lines later but I guess that's the trick right? I dumped out the in-memory log from the coredump but that wasn't very informative AFAICT. We might get more insight by repeating this with debug_ms=20, at least for the in-memory log.

Actions #25

Updated by Brad Hubbard over 1 year ago

Still trying to run a test with added debugging due to the ongoing infra issues but I noticed that Coverity CID 1509605 warns of connection_features is not initialised during the ProtocolV2 object's construction and that at several points, including when we are reusing a connection, we call connection->set_features(connection_features) so we may have a race where we are calling set_features with connection_features == 0 when that is unexpected. A full debug log will hopefully make the exact nature of this more obvious.

Actions #26

Updated by Aishwarya Mathuria over 1 year ago

Hi Brad, thanks for all the pointers on the tracker!
I went through the code with Josh and Radek after looking at your findings. We suspect that a connection might be getting added to the "conns" map (which stores all the accepted connections) before we set_features for it. (https://github.com/ceph/ceph/blob/main/src/msg/async/AsyncMessenger.cc#L867)

In ProtocolV2::send_server_ident() https://github.com/ceph/ceph/blob/main/src/msg/async/ProtocolV2.cc#L2864 :

  ssize_t r = messenger->accept_conn(connection);

  connection->inject_delay();

  connection->lock.lock();

  if (r < 0) {
    ldout(cct, 1) << __func__ << " existing race replacing process for addr = " 
                  << connection->peer_addrs->msgr2_addr()
                  << " just fail later one(this)" << dendl;
    connection->inject_delay();
    return _fault();
  }
  if (state != SESSION_ACCEPTING) {
    ldout(cct, 1) << __func__
                  << " state changed while accept_conn, it must be mark_down" 
                  << dendl;
    ceph_assert(state == CLOSED || state == NONE);
    messenger->unregister_conn(connection);
    connection->inject_delay();
    return _fault();
  }

  connection->set_features(connection_features);

accept_conn is called before connection->set_features(connection_features)

Actions #27

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #57845: MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed added
Actions #28

Updated by Laura Flores over 1 year ago

/a/yuriw-2022-10-17_17:31:25-rados-wip-yuri7-testing-2022-10-17-0814-distro-default-smithi/7071031

Actions #29

Updated by Aishwarya Mathuria over 1 year ago

We suspect that this assert failure is hit in cases when we try to encode a message before the connection is in a state where the features have been set.
Trying out a potential fix, will update this tracker with the results.

Actions #30

Updated by Ben Gao over 1 year ago

Aishwarya Mathuria wrote:

We suspect that this assert failure is hit in cases when we try to encode a message before the connection is in a state where the features have been set.
Trying out a potential fix, will update this tracker with the results.

calling ProtocolV2::connect() put the object in state of START_CONNECT and a sequence of handshakes frame(including server identity frame) is pending to be exchanged before the connection is in ready state. The race condition is then sending message by the connection should not be allowed. The point is, only when the connection is ready it could be used to message exchanges.

Actions #31

Updated by Nitzan Mordechai over 1 year ago

I'm having the same BT in my tests:
/a/nmordech-2022-12-06_13:26:40-rados:thrash-erasure-code-wip-nitzan-peering-auth-log-shard-change-in-interval-distro-default-smithi/7105300

the core dump in that case point to ftrylockfile.c but the backtrace that we print in the log is the same

Actions #32

Updated by Laura Flores 9 months ago

/a/yuriw-2023-07-28_23:11:59-rados-reef-release-distro-default-smithi/7356250

Actions #33

Updated by Laura Flores 9 months ago

  • Backport set to reef
Actions #34

Updated by Radoslaw Zarzynski 8 months ago

bump up (TODO: review the 2nd PR).

Actions #35

Updated by Laura Flores 4 months ago

/a/yuriw-2024-01-03_01:21:07-rados-wip-yuri3-testing-2024-01-02-1236-quincy-distro-default-smithi/7504535

Actions #36

Updated by Laura Flores 4 months ago

  • Backport changed from reef to reef,quincy
Actions #37

Updated by Radoslaw Zarzynski 3 months ago

I'm pretty sure I was reviewing PRs for proto2 & protocol for handling the policy update properly. Aishwarya, could you please correct me if I'm wrong or link the PRs here?

Actions #39

Updated by Radoslaw Zarzynski 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 49619
Actions #40

Updated by Radoslaw Zarzynski 3 months ago

Fix reviewed, refactoring is needed.

Actions #41

Updated by Radoslaw Zarzynski 2 months ago

Waiting for changes in the PR. Bump up.

Actions #42

Updated by Radoslaw Zarzynski about 2 months ago

Bump up.

Actions #43

Updated by Laura Flores about 2 months ago

/a/yuriw-2024-02-28_22:39:54-rados-wip-yuri8-testing-2024-02-22-0734-reef-distro-default-smithi/7576288

Actions #44

Updated by Radoslaw Zarzynski about 2 months ago

note from bug scrub: Aishwarya is addressing the review's comments.

Actions #45

Updated by Radoslaw Zarzynski about 1 month ago

Bump up.

Actions #46

Updated by Radoslaw Zarzynski about 1 month ago

Bump up.

Actions #47

Updated by Radoslaw Zarzynski about 1 month ago

Review in progress.

Actions #48

Updated by Radoslaw Zarzynski 17 days ago

Comments addressed, rereview requested.

Actions #49

Updated by Radoslaw Zarzynski 17 days ago

Fix approved.

Actions #50

Updated by Radoslaw Zarzynski 10 days ago

Waiting for upstream QA.

Actions #51

Updated by Radoslaw Zarzynski 3 days ago

Yuri provided an update. Still in QA.

Actions

Also available in: Atom PDF