Project

General

Profile

Actions

Bug #65679

open

osd crashes due to inconsistency between the in-memory cache and on disk data of the snap mapper

Added by Xuehan Xu 19 days ago. Updated 19 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Operations in crimson can be interrupted, which is different from classic osds. The implementation of SnapMapper follows the assumptions of classic osds and modify the in-memory cache before the modifications are persisted to disk, which can lead to inconsistencies between the in-memory SnapMapper cache and the on disk data when pg interval changes. This can further lead to crimson osd crashes when the crimson osd is trying to trim snaps.

DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd - snaptrim_event(id=1298, detail=SnapTrimEvent(pgid=3.6 snapid=a3 needs_pause=1)): async almost done line 101
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] bluestore - bluestore.OmapIteratorImpl(0x593f960) valid is at 0x00000000000000036000000000000000000004397E
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] bluestore - bluestore(/da1/var/lib/ceph/osd/ceph-2) omap_get_values 3.6_head oid #3:60000000:.internal_pg_local::snapmapper:head# = 0
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] bluestore - maybe_unpin 0x3000009fc800 #3:60000000:.internal_pg_local::snapmapper:head# touched
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd - snaptrim_event(id=1298, detail=SnapTrimEvent(pgid=3.6 snapid=a3 needs_pause=1)): trimming 3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:a5
INFO  2024-04-28 14:58:44,306 [shard 0:main] osd - PerShardState::start_operation_may_interrupt, snaptrimobj_subevent(id=1299, detail=SnapTrimObjSubEvent(coid=3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:a5 snapid=a3))
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd - snaptrim_event(id=1298, detail=SnapTrimEvent(pgid=3.6 snapid=a3 needs_pause=1)): awaiting completion
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd - snaptrimobj_subevent(id=1299, detail=SnapTrimObjSubEvent(coid=3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:a5 snapid=a3)): getting obc for 3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:a5
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd -  pg_epoch 499 pg[3.6( v 486'480 (0'0,486'480] local-lis/les=497/498 n=7 ec=15/15 lis/c=497/497 les/c/f=498/498/0 sis=497) [2,0,1] r=0 lpr=497 crt=486'481 mlcod 0'0 active+clean+snaptrim  ObjectContextLoader::with_head_obc: object 3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:head
DEBUG 2024-04-28 14:58:44,306 [shard 0:main] osd -  pg_epoch 499 pg[3.6( v 486'480 (0'0,486'480] local-lis/les=497/498 n=7 ec=15/15 lis/c=497/497 les/c/f=498/498/0 sis=497) [2,0,1] r=0 lpr=497 crt=486'481 mlcod 0'0 active+clean+snaptrim  ObjectContextLoader::get_or_load_obc: cache hit on 3:6bb7f7ea:::scephqa03.cpp.bjat.qianxin-inc.cn204052-10:head
ceph-osd: /home/xuxuehan/nvme/rpmbuild/BUILD/ceph-19.0.0-3238-g80a8b2cdb5e/src/crimson/osd/object_context_loader.cc:108: crimson::osd::ObjectContextLoader::with_clone_obc_direct<RWState::RWWRITE>(hobject_t, with_obc_func_t&&)::<lambda(auto:134, auto:135)> mutable [with auto:134 = boost::intrusive_ptr<crimson::osd::ObjectContext>; auto:135 = boost::intrusive_ptr<crimson::osd::ObjectContext>; crimson::interruptible::interruptible_errorator<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<2>))>, crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<84>))> > >::future<> = crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<2>))>, crimson::unthrowable_wrapper<const std::error_code&, ((const std::error_code&)(& crimson::ec<84>))> >::_future<crimson::errorated_future_marker<void> > >]: Assertion `cit != std::end(ss.clones)' failed.
Aborting on shard 0.
Backtrace:
Reactor stalled for 157 ms on shard 0. Backtrace: 0x2f0bb0d 0x2ec042d 0x2ec0778 0x2ec0907 0x12cdf 0x1692e3 0x1d0d761 0x1d0e6c7 0x1d0b466 0x1d0be40 0x1d0c2f8 0x12cdf 0x4ea4e 0x21db4 0x21c88 0x473a5 0x166cc6b 0x16577ad 0x16579d3 0x1657bee 0x165f0e3 0x165f530 0x165f73e 0x165fca3 0x1660178 0x1667ae3 0x1669670 0x1777ef7 0x1778d2d 0x1795477 0x179588b 0x17960b8 0x2eb82d7 0x2eb86f3 0x2ef9d25 0x2efaa6c 0x2e48bdc 0x2e49464 0x137722a 0x3aca2 0x13cc04d
kernel callstack:
 0# gsignal in /lib64/libc.so.6
 1# abort in /lib64/libc.so.6
 2# 0x00002B4649DCDC89 in /lib64/libc.so.6
 3# 0x00002B4649DF33A6 in /lib64/libc.so.6
 4# std::_Function_handler<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<2> >, crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<84> > >::_future<crimson::errorated_future_marker<void> > > (boost::intrusive_ptr<crimson::osd::ObjectContext>, boost::intrusive_ptr<crimson::osd::ObjectContext>), crimson::osd::ObjectContextLoader::with_clone_obc_direct<(RWState::State)2>(hobject_t, std::function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<2> >, crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<84> > >::_future<crimson::errorated_future_marker<void> > > (boost::intrusive_ptr<crimson::osd::ObjectContext>, boost::intrusive_ptr<crimson::osd::ObjectContext>)>&&)::{lambda(auto:1, auto:2)#1}>::_M_invoke(std::_Any_data const&, boost::intrusive_ptr<crimson::osd::ObjectContext>&&, std::_Any_data const&) in ceph-osd
 5# crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<2> >, crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<84> > >::_future<crimson::errorated_future_marker<void> > > seastar::futurize<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<2> >, crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<84> > >::_future<crimson::errorated_future_marker<void> > > >::invoke<crimson::osd::ObjectContextLoader::with_head_obc<(RWState::State)1>(boost::intrusive_ptr<crimson::osd::ObjectContext>, bool, std::function<crimson::interruptible::interruptible_future_detail<crimson::osd::IOInterruptCondition, crimson::errorator<crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<2> >, crimson::unthrowable_wrapper<std::error_code const&, crimson::ec<84> > >::_future<crimson::errorated_future_marker<void> > > (boost::intrusive_ptr<crimson::osd::ObjectContext>, boost::intrusive_ptr<crimson::osd::ObjectContext>)>&&)::{lambda()#1}::operator()() const::{lambda(auto:1)#1}, boost::intrusive_ptr<crimson::osd::ObjectContext> >({lambda()#1}&&, boost::intrusive_ptr<crimson::osd::ObjectContext>&&) in ceph-osd
Actions #1

Updated by Xuehan Xu 19 days ago

  • Description updated (diff)
Actions #2

Updated by Xuehan Xu 19 days ago

  • Pull request ID set to 57125
Actions #3

Updated by Xuehan Xu 19 days ago

  • Description updated (diff)
Actions

Also available in: Atom PDF