Bug #57845
openMOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed
0%
Description
Oct 18 06:52:00 ceph-osd[22686]: osd.21 pg_epoch: 73436 pg[3.5b1( v 73435'7961629 (73434'7958499,73435'7961629] local-lis/les=73155/73156 n=544640 ec=66174/333 lis/c=73155/73155 les/c/f=73156/73156/0 sis=73436) [38,31] r=-1 lpr=73436 pi=[73155,73436)/1 crt=73435'7961629 lcod 73435'7961628 mlcod 73435'7961628 active+snaptrim_wait mbc={255={}} ps=[144e~1]] start_peering_interval up [21,38,31] -> [38,31], acting [21,38,31] -> [38,31], acting_primary 21 -> 38, up_primary 21 -> 38, role 0 -> -1, features acting 4540138320759226367 upacting 4540138320759226367 Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: ceph-osd: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/messages/MOSDRepOp.h:127: virtual void MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: *** Caught signal (Aborted) ** Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: in thread 7f3a578f6700 thread_name:tp_osd_tp Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable) Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f3a7d9a0ce0] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 2: gsignal() Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 3: abort() Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 4: /lib64/libc.so.6(+0x21c89) [0x7f3a7c5e8c89] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 5: /lib64/libc.so.6(+0x473a6) [0x7f3a7c60e3a6] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x560d676bf720] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x560d67b39bde] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 8: (ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x560d67dc1764] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 9: (ProtocolV2::send_message(Message*)+0x3ae) [0x560d67dc1dbe] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 10: (AsyncConnection::send_message(Message*)+0x53e) [0x560d67d9e46e] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0xf5) [0x560d672d49b5] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x557) [0x560d676a8187] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xa94) [0x560d676aaa64] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x560d673f5c10] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x120) [0x560d673f7e70] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 16: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x666) [0x560d674369d6] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 17: (boost::statechart::simple_state<PrimaryLogPG::AwaitAsyncWork, PrimaryLogPG::Trimming, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x121) [0x560d674bea61] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 18: (boost::statechart::state_machine<PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x560d67487604] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 19: (PrimaryLogPG::snap_trimmer(unsigned int)+0x17a) [0x560d673eba3a] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 20: (ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1f) [0x560d675a800f] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 21: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x560d67302a48] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 22: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x560d67a0ac84] Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: 23: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560d67a0c024]
{ "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7f57b6d16ce0]", "gsignal()", "abort()", "/lib64/libc.so.6(+0x21c89) [0x7f57b595ec89]", "/lib64/libc.so.6(+0x473a6) [0x7f57b59843a6]", "(MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x561814cbf720]", "(Message::encode(unsigned long, int, bool)+0x2e) [0x561815139bde]", "(ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x5618153c1764]", "(ProtocolV2::send_message(Message*)+0x3ae) [0x5618153c1dbe]", "(AsyncConnection::send_message(Message*)+0x53e) [0x56181539e46e]", "(OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0xf5) [0x5618148d49b5]", "(ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x557) [0x561814ca8187]", "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xa94) [0x561814caaa64]", "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x5618149f5c10]", "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x120) [0x5618149f7e70]", "(PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x666) [0x561814a369d6]", "(boost::statechart::simple_state<PrimaryLogPG::AwaitAsyncWork, PrimaryLogPG::Trimming, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x121) [0x561814abea61]", "(boost::statechart::state_machine<PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x561814a87604]", "(PrimaryLogPG::snap_trimmer(unsigned int)+0x17a) [0x5618149eba3a]", "(ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1f) [0x561814ba800f]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x561814902a48]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56181500ac84]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56181500c024]", "/lib64/libpthread.so.0(+0x81ca) [0x7f57b6d0c1ca]", "clone()" ], "ceph_version": "17.2.1", "crash_id": "2022-10-11T09:16:43.148993Z_2069c770-5f77-47ff-919d-30763b19d548", "entity_name": "osd.85", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "21e10e4b6c715f189d6d512a325c05c75d7b42bc66f53addd7fc78be8e33762c", "timestamp": "2022-10-11T09:16:43.148993Z", "utsname_machine": "x86_64", "utsname_release": "5.15.0-48-generic", "utsname_sysname": "Linux", "utsname_version": "#54~20.04.1-Ubuntu SMP Thu Sep 1 16:17:26 UTC 2022" }
Files
Updated by Neha Ojha over 1 year ago
- Status changed from New to Need More Info
- Backport set to quincy
These reports in telemetry look similar: http://telemetry.front.sepia.ceph.com:4000/d/Nvj6XTaMk/spec-search?orgId=1&var-substr_1=PrimaryLogPG::snap_trimmer&var-substr_2=MOSDRepOp::encode_payload&var-substr_3=&var-minors_affected=&var-assert_function=&var-assert_condition=&var-total_results=2&var-sig_v1=&var-sig_v2=&var-daemons=&var-only_new_fingerprints=false&var-status_description=All&var-only_open=false
Do you have the log file from the crash?
Updated by Andreas Teuchert over 1 year ago
- File osd.21.log.gz osd.21.log.gz added
Hi Neha,
the logs from the crash instance that I reported initially are already rotated out on the particular node. However, there have been 14 more of those crashes (affecting different OSDs in different nodes), the most recent ones this morning and the corresponding logs are still there.
I used
journalctl --utc --no-hostname -S 08:30 -U 09:30 -u ceph-e0af78be-9a14-11ec-befe-6350198b95bb@osd.21
to capture the logs around the most recent crash:
"ceph_version": "17.2.1", "crash_id": "2022-10-18T06:52:00.981608Z_f7fc4622-61c1-437e-844a-38f92aa21a93", "entity_name": "osd.21", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "21e10e4b6c715f189d6d512a325c05c75d7b42bc66f53addd7fc78be8e33762c", "timestamp": "2022-10-18T06:52:00.981608Z", "utsname_machine": "x86_64", "utsname_release": "5.15.0-48-generic", "utsname_sysname": "Linux", "utsname_version": "#54~20.04.1-Ubuntu SMP Thu Sep 1 16:17:26 UTC 2022"
The output is attached (note: times in the crash info and log in UTC, times in the journalctl command in UTC+2). Is this the log you need? Please let me know if you need any further logs or information.
Updated by Neha Ojha over 1 year ago
- Subject changed from OSD crash: MOSDRepOp::encode_payload(unsigned long) to MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed
Updated by Radoslaw Zarzynski over 1 year ago
Just before the crash time-outs were seen:
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b8fe700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5c900700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a570f5700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a578f6700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a588f8700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b0fd700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b8fe700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5c900700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: prioritycache tune_memory target: 4294967296 mapped: 4310818816 unmapped: 2048737280 heap: 6359556096 old mem: 2085789416 new mem: 2078613197 Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a570f5700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a578f6700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a588f8700' had timed out after 15.000000954s Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b0fd700' had timed out after 15.000000954s
Anyway, the repop was issued from the SnapTrimmer
when the primary assignation got just change. Likely we miss a check for that in SnapTrimmer
.
Let's talk about this case during the RADOS meeting.
Updated by Neha Ojha over 1 year ago
- Related to Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. added
Updated by Neha Ojha over 1 year ago
- Status changed from Need More Info to New
Notes from rados team meeting:
Seems like the same class of bugs we hit in https://tracker.ceph.com/issues/52657 and https://tracker.ceph.com/issues/53685. The next step is to investigate the coredump captured in the other bugs and continue our investigation to verify what the features being set look like, more details in https://docs.google.com/document/d/1s-Vzv3yLTMSO8Hz_MHMg5ix1v53P4jlN6dX1L06yyls/edit#heading=h.tyvt1h1hbs56
Updated by Radoslaw Zarzynski over 1 year ago
- Related to Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' added
Updated by Radoslaw Zarzynski over 1 year ago
Likely it's even a duplicate of https://tracker.ceph.com/issues/52657.
Updated by Nitzan Mordechai 23 days ago
/a/yuriw-2024-03-25_20:36:27-rados-wip-yuri3-testing-2024-03-25-0744-distro-default-smithi/7621805