Project

General

Profile

Bug #57845

MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed

Added by Andreas Teuchert 4 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Oct 18 06:52:00 ceph-osd[22686]: osd.21 pg_epoch: 73436 pg[3.5b1( v 73435'7961629 (73434'7958499,73435'7961629] local-lis/les=73155/73156 n=544640 ec=66174/333 lis/c=73155/73155 les/c/f=73156/73156/0 sis=73436) [38,31] r=-1 lpr=73436 pi=[73155,73436)/1 crt=73435'7961629 lcod 73435'7961628 mlcod 73435'7961628 active+snaptrim_wait mbc={255={}} ps=[144e~1]] start_peering_interval up [21,38,31] -> [38,31], acting [21,38,31] -> [38,31], acting_primary 21 -> 38, up_primary 21 -> 38, role 0 -> -1, features acting 4540138320759226367 upacting 4540138320759226367
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: ceph-osd: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/messages/MOSDRepOp.h:127: virtual void MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed.
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]: *** Caught signal (Aborted) **
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  in thread 7f3a578f6700 thread_name:tp_osd_tp
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  1: /lib64/libpthread.so.0(+0x12ce0) [0x7f3a7d9a0ce0]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  2: gsignal()
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  3: abort()
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  4: /lib64/libc.so.6(+0x21c89) [0x7f3a7c5e8c89]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  5: /lib64/libc.so.6(+0x473a6) [0x7f3a7c60e3a6]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x560d676bf720]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  7: (Message::encode(unsigned long, int, bool)+0x2e) [0x560d67b39bde]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  8: (ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x560d67dc1764]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  9: (ProtocolV2::send_message(Message*)+0x3ae) [0x560d67dc1dbe]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  10: (AsyncConnection::send_message(Message*)+0x53e) [0x560d67d9e46e]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  11: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0xf5) [0x560d672d49b5]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  12: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x557) [0x560d676a8187]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  13: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xa94) [0x560d676aaa64]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  14: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x560d673f5c10]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  15: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x120) [0x560d673f7e70]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  16: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x666) [0x560d674369d6]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  17: (boost::statechart::simple_state<PrimaryLogPG::AwaitAsyncWork, PrimaryLogPG::Trimming, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x121) [0x560d674bea61]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  18: (boost::statechart::state_machine<PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x560d67487604]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  19: (PrimaryLogPG::snap_trimmer(unsigned int)+0x17a) [0x560d673eba3a]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  20: (ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1f) [0x560d675a800f]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  21: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x560d67302a48]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  22: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x560d67a0ac84]
Oct 18 06:52:00 ceph-e0af78be-9a14-11ec-befe-6350198b95bb-osd-21[22642]:  23: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x560d67a0c024]
{
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f57b6d16ce0]",
        "gsignal()",
        "abort()",
        "/lib64/libc.so.6(+0x21c89) [0x7f57b595ec89]",
        "/lib64/libc.so.6(+0x473a6) [0x7f57b59843a6]",
        "(MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x561814cbf720]",
        "(Message::encode(unsigned long, int, bool)+0x2e) [0x561815139bde]",
        "(ProtocolV2::prepare_send_message(unsigned long, Message*)+0x44) [0x5618153c1764]",
        "(ProtocolV2::send_message(Message*)+0x3ae) [0x5618153c1dbe]",
        "(AsyncConnection::send_message(Message*)+0x53e) [0x56181539e46e]",
        "(OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0xf5) [0x5618148d49b5]",
        "(ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x557) [0x561814ca8187]",
        "(ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0xa94) [0x561814caaa64]",
        "(PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xc90) [0x5618149f5c10]",
        "(PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x120) [0x5618149f7e70]",
        "(PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork const&)+0x666) [0x561814a369d6]",
        "(boost::statechart::simple_state<PrimaryLogPG::AwaitAsyncWork, PrimaryLogPG::Trimming, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x121) [0x561814abea61]",
        "(boost::statechart::state_machine<PrimaryLogPG::SnapTrimmer, PrimaryLogPG::NotTrimming, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x74) [0x561814a87604]",
        "(PrimaryLogPG::snap_trimmer(unsigned int)+0x17a) [0x5618149eba3a]",
        "(ceph::osd::scheduler::PGSnapTrim::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1f) [0x561814ba800f]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x561814902a48]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56181500ac84]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56181500c024]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f57b6d0c1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.1",
    "crash_id": "2022-10-11T09:16:43.148993Z_2069c770-5f77-47ff-919d-30763b19d548",
    "entity_name": "osd.85",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "21e10e4b6c715f189d6d512a325c05c75d7b42bc66f53addd7fc78be8e33762c",
    "timestamp": "2022-10-11T09:16:43.148993Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-48-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#54~20.04.1-Ubuntu SMP Thu Sep 1 16:17:26 UTC 2022" 
}

osd.21.log.gz (248 KB) Andreas Teuchert, 10/18/2022 09:01 AM


Related issues

Related to RADOS - Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. New
Related to RADOS - Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' In Progress

History

#2 Updated by Andreas Teuchert 4 months ago

Hi Neha,

the logs from the crash instance that I reported initially are already rotated out on the particular node. However, there have been 14 more of those crashes (affecting different OSDs in different nodes), the most recent ones this morning and the corresponding logs are still there.

I used

journalctl --utc --no-hostname -S 08:30 -U 09:30 -u ceph-e0af78be-9a14-11ec-befe-6350198b95bb@osd.21

to capture the logs around the most recent crash:

    "ceph_version": "17.2.1",
    "crash_id": "2022-10-18T06:52:00.981608Z_f7fc4622-61c1-437e-844a-38f92aa21a93",
    "entity_name": "osd.21",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "21e10e4b6c715f189d6d512a325c05c75d7b42bc66f53addd7fc78be8e33762c",
    "timestamp": "2022-10-18T06:52:00.981608Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-48-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#54~20.04.1-Ubuntu SMP Thu Sep 1 16:17:26 UTC 2022" 

The output is attached (note: times in the crash info and log in UTC, times in the journalctl command in UTC+2). Is this the log you need? Please let me know if you need any further logs or information.

#3 Updated by Neha Ojha 3 months ago

  • Description updated (diff)

#4 Updated by Neha Ojha 3 months ago

  • Subject changed from OSD crash: MOSDRepOp::encode_payload(unsigned long) to MOSDRepOp::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed

#5 Updated by Neha Ojha 3 months ago

  • Description updated (diff)

#6 Updated by Radoslaw Zarzynski 3 months ago

Just before the crash time-outs were seen:

Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b8fe700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5c900700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a570f5700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a578f6700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a588f8700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b0fd700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b8fe700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5c900700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: prioritycache tune_memory target: 4294967296 mapped: 4310818816 unmapped: 2048737280 heap: 6359556096 old mem: 2085789416 new mem: 2078613197
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a570f5700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a578f6700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a588f8700' had timed out after 15.000000954s
Oct 18 06:52:01 ceph-osd[22686]: heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f3a5b0fd700' had timed out after 15.000000954s

Anyway, the repop was issued from the SnapTrimmer when the primary assignation got just change. Likely we miss a check for that in SnapTrimmer.

Let's talk about this case during the RADOS meeting.

#7 Updated by Neha Ojha 3 months ago

  • Related to Bug #53685: Assertion `HAVE_FEATURE(features, SERVER_OCTOPUS)' failed. added

#8 Updated by Neha Ojha 3 months ago

  • Status changed from Need More Info to New

Notes from rados team meeting:

Seems like the same class of bugs we hit in https://tracker.ceph.com/issues/52657 and https://tracker.ceph.com/issues/53685. The next step is to investigate the coredump captured in the other bugs and continue our investigation to verify what the features being set look like, more details in https://docs.google.com/document/d/1s-Vzv3yLTMSO8Hz_MHMg5ix1v53P4jlN6dX1L06yyls/edit#heading=h.tyvt1h1hbs56

#9 Updated by Radoslaw Zarzynski 3 months ago

  • Related to Bug #52657: MOSDPGLog::encode_payload(uint64_t): Assertion `HAVE_FEATURE(features, SERVER_NAUTILUS)' added

#10 Updated by Radoslaw Zarzynski 3 months ago

Likely it's even a duplicate of https://tracker.ceph.com/issues/52657.

Also available in: Atom PDF