Bug #56661
Quincy: OSD crashing one after another with data loss with ceph_assert_fail
0%
Description
After two weeks after an upgrade to quincy from a octopus setup, the SSD pool reported one OSD down in the middle of the night while the weekly cleanup cronjobs of the application on top were running. The OSD was an Datacenter NVME-SSD (8x4TB) with a 3 replica writeback cache overlay pool in it.
Strangely immediately after manually restarting the affected daemon another OSD on a different machine went down while the first one stayed up. The service started flapping after a few minutes - which resulted in a third OSD down, while the pool tried to resilver. It seems like this issue was getting more and more problematic the more IOs were issued on the pool. After a few hours there was a data loss of the writeback pool as too many OSDs were offline. Luckily only the SSDs were affected, the HDD OSDs on the same machines were still online.
Deployed with cephadm on Debian/11 with Docker
As this is a productive ceph cluster i was unable to keep it that way - at the moment the SSDs are completly removed from the cluster and not in use as we cant make sure that the HDD OSDS will not get this issue as well.
Two logs from two OSDs:
debug -42> 2022-07-16T10:15:48.508+0000 7f39bb59d700 3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732] debug -41> 2022-07-16T10:15:48.512+0000 7f39bb59d700 3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732] debug -40> 2022-07-16T10:15:48.552+0000 7f39bb59d700 3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732] debug -39> 2022-07-16T10:15:48.736+0000 7f39c339f700 5 prioritycache tune_memory target: 4294967296 mapped: 530841600 unmapped: 11681792 heap: 542523392 old mem: 2845415832 new mem: 2845415832 debug -38> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: tick debug -37> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-07-16T10:15:18.775136+0000) debug -36> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 log_client log_queue is 1 last_log 3 sent 2 num 1 unsent 1 sending 1 debug -35> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 log_client will send 2022-07-16T10:15:48.508321+0000 osd.67 (osd.67) 3 : cluster [INF] 11.13 continuing backfill to osd.66 from (29574'128239589,30566'128249589] MIN to 30566'128249589 debug -34> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: _send_mon_message to mon.cle02-18 at v2:172.16.0.18:3300/0 debug -33> 2022-07-16T10:15:49.284+0000 7f39bb59d700 3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732] debug -32> 2022-07-16T10:15:49.296+0000 7f39b0d88700 5 osd.67 pg_epoch: 30732 pg[11.13( v 30566'128249589 lc 29631'128242596 (29574'128239552,30566'128249589] local-lis/les=30731/30732 n=36430 ec=15753/15753 lis/c=30731/28565 les/c/f=30732/28566/0 sis=30731) [65,66,67]/[67,65] backfill=[66] r=0 lpr=30731 pi=[28565,30731)/4 crt=30566'128249589 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped m=424 mbc={255={(1+0)=2021}}] on_activate_complete: bft=66 from MIN debug -31> 2022-07-16T10:15:49.296+0000 7f39b0d88700 5 osd.67 pg_epoch: 30732 pg[11.13( v 30566'128249589 lc 29631'128242596 (29574'128239552,30566'128249589] local-lis/les=30731/30732 n=36430 ec=15753/15753 lis/c=30731/28565 les/c/f=30732/28566/0 sis=30731) [65,66,67]/[67,65] backfill=[66] r=0 lpr=30731 pi=[28565,30731)/4 crt=30566'128249589 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped m=424 mbc={255={(1+0)=2021}}] target shard 66 from MIN debug -30> 2022-07-16T10:15:49.296+0000 7f39d2429700 10 monclient: get_auth_request con 0x559958619400 auth_method 0 debug -29> 2022-07-16T10:15:49.296+0000 7f39d2429700 10 monclient: get_auth_request con 0x559956eecc00 auth_method 0 debug -28> 2022-07-16T10:15:49.300+0000 7f39d2429700 10 monclient: get_auth_request con 0x55994f31d400 auth_method 0 debug -27> 2022-07-16T10:15:49.300+0000 7f39d1c28700 10 monclient: get_auth_request con 0x5599596cc400 auth_method 0 debug -26> 2022-07-16T10:15:49.300+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959cb8000 auth_method 0 debug -25> 2022-07-16T10:15:49.300+0000 7f39d2429700 10 monclient: get_auth_request con 0x5599596cd400 auth_method 0 debug -24> 2022-07-16T10:15:49.300+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959cb9400 auth_method 0 debug -23> 2022-07-16T10:15:49.300+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959cb8c00 auth_method 0 debug -22> 2022-07-16T10:15:49.304+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959e10000 auth_method 0 debug -21> 2022-07-16T10:15:49.304+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959e11000 auth_method 0 debug -20> 2022-07-16T10:15:49.304+0000 7f39d1427700 10 monclient: get_auth_request con 0x559958d4a000 auth_method 0 debug -19> 2022-07-16T10:15:49.304+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959cb8800 auth_method 0 debug -18> 2022-07-16T10:15:49.308+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959cb9c00 auth_method 0 debug -17> 2022-07-16T10:15:49.312+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959fc8400 auth_method 0 debug -16> 2022-07-16T10:15:49.316+0000 7f39d2429700 10 monclient: get_auth_request con 0x5599596cc000 auth_method 0 debug -15> 2022-07-16T10:15:49.320+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959e10c00 auth_method 0 debug -14> 2022-07-16T10:15:49.324+0000 7f39d1427700 10 monclient: get_auth_request con 0x559958d4b000 auth_method 0 debug -13> 2022-07-16T10:15:49.344+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959e1dc00 auth_method 0 debug -12> 2022-07-16T10:15:49.344+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959d24000 auth_method 0 debug -11> 2022-07-16T10:15:49.344+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959900400 auth_method 0 debug -10> 2022-07-16T10:15:49.344+0000 7f39d2429700 10 monclient: get_auth_request con 0x559956eed400 auth_method 0 debug -9> 2022-07-16T10:15:49.344+0000 7f39d1c28700 10 monclient: get_auth_request con 0x55994f31d000 auth_method 0 debug -8> 2022-07-16T10:15:49.348+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959860400 auth_method 0 debug -7> 2022-07-16T10:15:49.356+0000 7f39d2429700 10 monclient: get_auth_request con 0x55995993b000 auth_method 0 debug -6> 2022-07-16T10:15:49.356+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559958d4bc00 auth_method 0 debug -5> 2022-07-16T10:15:49.356+0000 7f39d1427700 10 monclient: get_auth_request con 0x55995993b800 auth_method 0 debug -4> 2022-07-16T10:15:49.360+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959861800 auth_method 0 debug -3> 2022-07-16T10:15:49.360+0000 7f39d1c28700 10 monclient: get_auth_request con 0x5599413c4c00 auth_method 0 debug -2> 2022-07-16T10:15:49.368+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959e1cc00 auth_method 0 debug -1> 2022-07-16T10:15:49.376+0000 7f39b4d90700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0/rpm/el8/BUILD/ceph-17.2.0/src/osd/osd_internal_types.h: In function 'void ObjectContext::stop_block()' thread 7f39b4d90700 time 2022-07-16T10:15:49.377956+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0/rpm/el8/BUILD/ceph-17.2.0/src/osd/osd_internal_types.h: 175: FAILED ceph_assert(blocked) ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x559936dd744e] 2: /usr/bin/ceph-osd(+0x5d766f) [0x559936dd766f] 3: (PrimaryLogPG::kick_object_context_blocked(std::shared_ptr<ObjectContext>)+0x38b) [0x55993700481b] 4: (PrimaryLogPG::process_copy_chunk(hobject_t, unsigned long, int)+0xc58) [0x5599370054c8] 5: (C_Copyfrom::finish(int)+0xbf) [0x5599370b7bbf] 6: (Context::complete(int)+0xd) [0x559936f1fd1d] 7: (Finisher::finisher_thread_entry()+0x19d) [0x5599375c6e6d] 8: /lib64/libpthread.so.0(+0x81ca) [0x7f39d64191ca] 9: clone() debug 0> 2022-07-16T10:15:49.376+0000 7f39b4d90700 -1 *** Caught signal (Aborted) ** in thread 7f39b4d90700 thread_name:finisher ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f39d6423ce0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x559936dd74ac] 5: /usr/bin/ceph-osd(+0x5d766f) [0x559936dd766f] 6: (PrimaryLogPG::kick_object_context_blocked(std::shared_ptr<ObjectContext>)+0x38b) [0x55993700481b] 7: (PrimaryLogPG::process_copy_chunk(hobject_t, unsigned long, int)+0xc58) [0x5599370054c8] 8: (C_Copyfrom::finish(int)+0xbf) [0x5599370b7bbf] 9: (Context::complete(int)+0xd) [0x559936f1fd1d] 10: (Finisher::finisher_thread_entry()+0x19d) [0x5599375c6e6d] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f39d64191ca] 12: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_cleaner 0/ 5 seastore_lba 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 alienstore 1/ 5 mclock -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f39aa57b700 / osd_srv_heartbt 7f39aad7c700 / tp_osd_tp 7f39ab57d700 / tp_osd_tp 7f39abd7e700 / tp_osd_tp 7f39ac57f700 / tp_osd_tp 7f39acd80700 / tp_osd_tp 7f39ad581700 / tp_osd_tp 7f39add82700 / tp_osd_tp 7f39ae583700 / tp_osd_tp 7f39aed84700 / tp_osd_tp 7f39af585700 / tp_osd_tp 7f39afd86700 / tp_osd_tp 7f39b0587700 / tp_osd_tp 7f39b0d88700 / tp_osd_tp 7f39b1589700 / tp_osd_tp 7f39b1d8a700 / tp_osd_tp 7f39b258b700 / tp_osd_tp 7f39b4d90700 / finisher 7f39bb59d700 / ms_dispatch 7f39bbd9e700 / ceph-osd 7f39bd393700 / safe_timer 7f39bdb94700 / cfin 7f39be395700 / bstore_kv_sync 7f39bfe1d700 / ms_dispatch 7f39c339f700 / bstore_mempool 7f39c4db0700 / rocksdb:low0 7f39c75b5700 / fn_anonymous 7f39c8db8700 / safe_timer 7f39cd5c1700 / safe_timer 7f39cec22700 / io_context_pool 7f39d0425700 / io_context_pool 7f39d0c26700 / admin_socket 7f39d1427700 / msgr-worker-2 7f39d1c28700 / msgr-worker-1 7f39d2429700 / msgr-worker-0 7f39d821f3c0 / ceph-osd max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2022-07-16T10:15:49.383614Z_0a14c5d0-cbb3-4cbd-93dd-ae49c471c78c/log --- end dump of recent events ---
debug -12> 2022-07-16T13:10:36.612+0000 7f051f64f700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/osd/MissingLoc.h: In function 'void MissingLoc::add_active_missing(const pg_missing_t&)' thread 7f051f64f700 time 2022-07-16T13:10:36.606949+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/osd/MissingLoc.h: 242: FAILED ceph_assert(0 == "unexpected need for missing item") ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55d55ebd88b4] 2: /usr/bin/ceph-osd(+0x5d8ad5) [0x55d55ebd8ad5] 3: (MissingLoc::add_active_missing(pg_missing_set<false> const&)+0x1f2) [0x55d55f020cb2] 4: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x166a) [0x55d55efda2ba] 5: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x336) [0x55d55effbf76] 6: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Active, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb6) [0x55d55f02c8c6] 7: (boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x55d55f02cd5a] 8: (boost::statechart::simple_state<PeeringState::WaitUpThru, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa9) [0x55d55f02a859] 9: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xb4) [0x55d55f0131a4] 10: (PeeringState::activate_map(PeeringCtx&)+0x122) [0x55d55efcb5a2] 11: (PG::handle_activate_map(PeeringCtx&)+0x127) [0x55d55ed9cd57] 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x67c) [0x55d55ed0f0bc] 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x55d55ed10c84] 14: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d55efa87a6] 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d55ed02a48] 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d55f40ac84] 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d55f40c024] 18: /lib64/libpthread.so.0(+0x81ca) [0x7f0543cde1ca] 19: clone() debug -11> 2022-07-16T13:10:36.620+0000 7f051ee4e700 5 osd.65 scrub-queue::remove_from_osd_queue removing pg[11.1a] failed. State was: not registered w/ OSD debug -10> 2022-07-16T13:10:36.624+0000 7f051a645700 5 osd.65 scrub-queue::remove_from_osd_queue removing pg[4.3] failed. State was: unregistering debug -9> 2022-07-16T13:10:36.624+0000 7f051a645700 5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.019136 3 0.002617 debug -8> 2022-07-16T13:10:36.624+0000 7f051a645700 5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved debug -7> 2022-07-16T13:10:36.624+0000 7f051e64d700 5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000032 1 0.000034 debug -6> 2022-07-16T13:10:36.624+0000 7f051e64d700 5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] enter Started/ReplicaActive/RepRecovering debug -5> 2022-07-16T13:10:36.624+0000 7f051f64f700 -1 *** Caught signal (Aborted) ** in thread 7f051f64f700 thread_name:tp_osd_tp ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f0543ce8ce0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x55d55ebd8912] 5: /usr/bin/ceph-osd(+0x5d8ad5) [0x55d55ebd8ad5] 6: (MissingLoc::add_active_missing(pg_missing_set<false> const&)+0x1f2) [0x55d55f020cb2] 7: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x166a) [0x55d55efda2ba] 8: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x336) [0x55d55effbf76] 9: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Active, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb6) [0x55d55f02c8c6] 10: (boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x55d55f02cd5a] 11: (boost::statechart::simple_state<PeeringState::WaitUpThru, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa9) [0x55d55f02a859] 12: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xb4) [0x55d55f0131a4] 13: (PeeringState::activate_map(PeeringCtx&)+0x122) [0x55d55efcb5a2] 14: (PG::handle_activate_map(PeeringCtx&)+0x127) [0x55d55ed9cd57] 15: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x67c) [0x55d55ed0f0bc] 16: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x55d55ed10c84] 17: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d55efa87a6] 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d55ed02a48] 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d55f40ac84] 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d55f40c024] 21: /lib64/libpthread.so.0(+0x81ca) [0x7f0543cde1ca] 22: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. debug -4> 2022-07-16T13:10:36.624+0000 7f051e64d700 5 osd.65 scrub-queue::remove_from_osd_queue removing pg[11.13] failed. State was: unregistering debug -3> 2022-07-16T13:10:36.628+0000 7f051e64d700 5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.016345 3 0.006503 debug -2> 2022-07-16T13:10:36.628+0000 7f051e64d700 5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved debug -1> 2022-07-16T13:10:36.628+0000 7f051a645700 5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000036 1 0.000034 debug 0> 2022-07-16T13:10:36.628+0000 7f051a645700 5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] enter Started/ReplicaActive/RepRecovering --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_cleaner 0/ 5 seastore_lba 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 alienstore 1/ 5 mclock -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f0517e40700 / osd_srv_heartbt 7f0518641700 / tp_osd_tp 7f0518e42700 / tp_osd_tp 7f0519643700 / tp_osd_tp 7f0519e44700 / tp_osd_tp 7f051a645700 / tp_osd_tp 7f051ae46700 / tp_osd_tp 7f051b647700 / tp_osd_tp 7f051be48700 / tp_osd_tp 7f051c649700 / tp_osd_tp 7f051ce4a700 / tp_osd_tp 7f051d64b700 / tp_osd_tp 7f051de4c700 / tp_osd_tp 7f051e64d700 / tp_osd_tp 7f051ee4e700 / tp_osd_tp 7f051f64f700 / tp_osd_tp 7f051fe50700 / tp_osd_tp 7f0528e62700 / ms_dispatch 7f0529663700 / ceph-osd 7f052ac58700 / safe_timer 7f052b459700 / cfin 7f052d6e2700 / ms_dispatch 7f0530c64700 / bstore_mempool 7f0534e7a700 / fn_anonymous 7f053667d700 / safe_timer 7f053ae86700 / safe_timer 7f053c4e7700 / io_context_pool 7f053dcea700 / io_context_pool 7f053e4eb700 / admin_socket 7f053ecec700 / msgr-worker-2 7f053f4ed700 / msgr-worker-1 7f053fcee700 / msgr-worker-0 7f0545ae43c0 / ceph-osd max_recent 10000 max_new 10000 log_file /var/lib/ceph/crash/2022-07-16T13:10:36.631036Z_67936d50-2f11-4995-92f3-5b6d51820b3d/log --- end dump of recent events ---
History
#1 Updated by Chris Kul about 1 year ago
BTW the initial version was 17.2.0, we tried to update to 17.2.1 in hope this bug got fixed, sadly without luck.
#2 Updated by Nitzan Mordechai about 1 year ago
- Assignee set to Nitzan Mordechai
#3 Updated by Nitzan Mordechai about 1 year ago
@Chris Kul, I'm trying to understand the sequence of failing osd's, can you please upload the osds logs that failed?
Does it cache tier ?
#4 Updated by Chris Kul about 1 year ago
Sadly i dont have any more logs anymore, as i had to destroy the ceph - getting it back in working order was top priority and i was unable to contact anyone via IRC or tracker (as it took a while to get access) to get the information which files i need to backup.
I just can remember, that the OSDs which had the PGs which were unavailable were affected the most, but in fact all OSDs which offered the cache were affected.
yes it was a cache tier - it was the writeback cache for the default pool "ceph".
#5 Updated by Nitzan Mordechai about 1 year ago
- Assignee changed from Nitzan Mordechai to Myoungwon Oh
Myoungwon Oh, can you please take a look?
#6 Updated by Myoungwon Oh about 1 year ago
Hm... at first glance, OSD calls stop_block() on a head object, which is already stopped, in kick_object_context_blocked(). I don't have any idea for now with regard to why this assert occurs because head_obc in kick_object_context_blocked() can not be interrupted by other tasks. I guess that this might be related with rollback, but I need more log to figure out the root cause.
#7 Updated by Chris Kul about 1 year ago
as i said, i dont have any more logs, as i had to bring the cluster back in a working state.
As this issue is coming with a complete data loss, do you see any other chance to find a solution for it?
#8 Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to Need More Info
Moving into Need More Info :-( per Myoungwon Oh's comment.
#9 Updated by Josh Durgin about 1 year ago
It looks like this must be some inconsistency between the head_obc->blocked state, and its presence in objects_blocked_on_snap_promotion