Actions
Bug #43106
closedmimic: crash in build_incremental_map_msg
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Since upgrading from 13.2.6 to 13.2.7 we get this around once per 10 minutes on a cluster with 500 out of 1500 OSDs upgraded:
2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889 build_incremental_map_msg missing incremental map 2758889 2019-12-03 15:53:51.817 7ff3a3d39700 -1 osd.1384 2758889 build_incremental_map_msg missing incremental map 2758889 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889 build_incremental_map_msg missing incremental map 2758889 2019-12-03 15:53:51.817 7ff3a453a700 -1 osd.1384 2758889 build_incremental_map_msg unable to load latest map 2758889 2019-12-03 15:53:51.822 7ff3a453a700 -1 *** Caught signal (Aborted) ** in thread 7ff3a453a700 thread_name:tp_osd_tp ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable) 1: (()+0xf5f0) [0x7ff3c620b5f0] 2: (gsignal()+0x37) [0x7ff3c522b337] 3: (abort()+0x148) [0x7ff3c522ca28] 4: (OSDService::build_incremental_map_msg(unsigned int, unsigned int, OSDSuperblock&)+0x767) [0x555d60e8d797] 5: (OSDService::send_incremental_map(unsigned int, Connection*, std::shared_ptr<OSDMap const>&)+0x39e) [0x555d60e8dbee] 6: (OSDService::share_map_peer(int, Connection*, std::shared_ptr<OSDMap const>)+0x159) [0x555d60e8eda9] 7: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x1a5) [0x555d60e8f085] 8: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ObjectStore::Transaction&)+0x452) [0x555d6116e522] 9: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x6f5) [0x555d6117ed85] 10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0xd62) [0x555d60ff5142] 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xf12) [0x555d61035902] 12: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x3679) [0x555d610397a9] 13: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc99) [0x555d6103d869] 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7) [0x555d60e8e8a7] 15: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x555d611144c2] 16: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x555d60eb25f2] 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7ff3c929f5b3] 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7ff3c92a01a0] 19: (()+0x7e65) [0x7ff3c6203e65] 20: (clone()+0x6d) [0x7ff3c52f388d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
related: https://tracker.ceph.com/issues/38282 (backported in 13.2.7)
related: https://tracker.ceph.com/issues/38330 (not yet backported to mimic?!)
Updated by Neha Ojha over 4 years ago
I think you are right. We should have backported all three PRs according to https://tracker.ceph.com/issues/38040#note-3, but ended up only backporting one. https://github.com/ceph/ceph/pull/26413 got backported as a part of https://tracker.ceph.com/issues/38282, but looks like https://github.com/ceph/ceph/pull/26448 was missed.
Updated by Neha Ojha over 4 years ago
- Related to Bug #38330: osd/OSD.cc: 1515: abort() in Service::build_incremental_map_msg added
Updated by Nathan Cutler over 4 years ago
The three PRs that need to be backported to mimic are:
- https://github.com/ceph/ceph/pull/26340 - backported to mimic by https://github.com/ceph/ceph/pull/29242 (merged for 13.2.7)
- https://github.com/ceph/ceph/pull/26413 - backported to mimic by https://github.com/ceph/ceph/pull/31236 (merged for 13.2.7)
- https://github.com/ceph/ceph/pull/26448 - in process of being backported to mimic, see https://github.com/ceph/ceph/pull/32000 (will be included in 13.2.8)
Updated by Neha Ojha over 4 years ago
- Subject changed from crash in build_incremental_map_msg to mimic: crash in build_incremental_map_msg
- Status changed from New to Resolved
Marking this resolved as all the backports are now in place.
Actions