Project

General

Profile

Actions

Bug #56661

open

Quincy: OSD crashing one after another with data loss with ceph_assert_fail

Added by Chris Kul almost 2 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
Scrub/Repair
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After two weeks after an upgrade to quincy from a octopus setup, the SSD pool reported one OSD down in the middle of the night while the weekly cleanup cronjobs of the application on top were running. The OSD was an Datacenter NVME-SSD (8x4TB) with a 3 replica writeback cache overlay pool in it.

Strangely immediately after manually restarting the affected daemon another OSD on a different machine went down while the first one stayed up. The service started flapping after a few minutes - which resulted in a third OSD down, while the pool tried to resilver. It seems like this issue was getting more and more problematic the more IOs were issued on the pool. After a few hours there was a data loss of the writeback pool as too many OSDs were offline. Luckily only the SSDs were affected, the HDD OSDs on the same machines were still online.

Deployed with cephadm on Debian/11 with Docker

As this is a productive ceph cluster i was unable to keep it that way - at the moment the SSDs are completly removed from the cluster and not in use as we cant make sure that the HDD OSDS will not get this issue as well.

Two logs from two OSDs:

debug    -42> 2022-07-16T10:15:48.508+0000 7f39bb59d700  3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732]
debug    -41> 2022-07-16T10:15:48.512+0000 7f39bb59d700  3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732]
debug    -40> 2022-07-16T10:15:48.552+0000 7f39bb59d700  3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732]
debug    -39> 2022-07-16T10:15:48.736+0000 7f39c339f700  5 prioritycache tune_memory target: 4294967296 mapped: 530841600 unmapped: 11681792 heap: 542523392 old mem: 2845415832 new mem: 2845415832
debug    -38> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: tick
debug    -37> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2022-07-16T10:15:18.775136+0000)
debug    -36> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 log_client  log_queue is 1 last_log 3 sent 2 num 1 unsent 1 sending 1
debug    -35> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 log_client  will send 2022-07-16T10:15:48.508321+0000 osd.67 (osd.67) 3 : cluster [INF] 11.13 continuing backfill to osd.66 from (29574'128239589,30566'128249589] MIN to 30566'128249589
debug    -34> 2022-07-16T10:15:48.768+0000 7f39bd393700 10 monclient: _send_mon_message to mon.cle02-18 at v2:172.16.0.18:3300/0
debug    -33> 2022-07-16T10:15:49.284+0000 7f39bb59d700  3 osd.67 30732 handle_osd_map epochs [30732,30732], i have 30732, src has [29161,30732]
debug    -32> 2022-07-16T10:15:49.296+0000 7f39b0d88700  5 osd.67 pg_epoch: 30732 pg[11.13( v 30566'128249589 lc 29631'128242596 (29574'128239552,30566'128249589] local-lis/les=30731/30732 n=36430 ec=15753/15753 lis/c=30731/28565 les/c/f=30732/28566/0 sis=30731) [65,66,67]/[67,65] backfill=[66] r=0 lpr=30731 pi=[28565,30731)/4 crt=30566'128249589 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped m=424 mbc={255={(1+0)=2021}}] on_activate_complete: bft=66 from MIN
debug    -31> 2022-07-16T10:15:49.296+0000 7f39b0d88700  5 osd.67 pg_epoch: 30732 pg[11.13( v 30566'128249589 lc 29631'128242596 (29574'128239552,30566'128249589] local-lis/les=30731/30732 n=36430 ec=15753/15753 lis/c=30731/28565 les/c/f=30732/28566/0 sis=30731) [65,66,67]/[67,65] backfill=[66] r=0 lpr=30731 pi=[28565,30731)/4 crt=30566'128249589 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped m=424 mbc={255={(1+0)=2021}}] target shard 66 from MIN
debug    -30> 2022-07-16T10:15:49.296+0000 7f39d2429700 10 monclient: get_auth_request con 0x559958619400 auth_method 0
debug    -29> 2022-07-16T10:15:49.296+0000 7f39d2429700 10 monclient: get_auth_request con 0x559956eecc00 auth_method 0
debug    -28> 2022-07-16T10:15:49.300+0000 7f39d2429700 10 monclient: get_auth_request con 0x55994f31d400 auth_method 0
debug    -27> 2022-07-16T10:15:49.300+0000 7f39d1c28700 10 monclient: get_auth_request con 0x5599596cc400 auth_method 0
debug    -26> 2022-07-16T10:15:49.300+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959cb8000 auth_method 0
debug    -25> 2022-07-16T10:15:49.300+0000 7f39d2429700 10 monclient: get_auth_request con 0x5599596cd400 auth_method 0
debug    -24> 2022-07-16T10:15:49.300+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959cb9400 auth_method 0
debug    -23> 2022-07-16T10:15:49.300+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959cb8c00 auth_method 0
debug    -22> 2022-07-16T10:15:49.304+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959e10000 auth_method 0
debug    -21> 2022-07-16T10:15:49.304+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959e11000 auth_method 0
debug    -20> 2022-07-16T10:15:49.304+0000 7f39d1427700 10 monclient: get_auth_request con 0x559958d4a000 auth_method 0
debug    -19> 2022-07-16T10:15:49.304+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959cb8800 auth_method 0
debug    -18> 2022-07-16T10:15:49.308+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959cb9c00 auth_method 0
debug    -17> 2022-07-16T10:15:49.312+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959fc8400 auth_method 0
debug    -16> 2022-07-16T10:15:49.316+0000 7f39d2429700 10 monclient: get_auth_request con 0x5599596cc000 auth_method 0
debug    -15> 2022-07-16T10:15:49.320+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959e10c00 auth_method 0
debug    -14> 2022-07-16T10:15:49.324+0000 7f39d1427700 10 monclient: get_auth_request con 0x559958d4b000 auth_method 0
debug    -13> 2022-07-16T10:15:49.344+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959e1dc00 auth_method 0
debug    -12> 2022-07-16T10:15:49.344+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559959d24000 auth_method 0
debug    -11> 2022-07-16T10:15:49.344+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959900400 auth_method 0
debug    -10> 2022-07-16T10:15:49.344+0000 7f39d2429700 10 monclient: get_auth_request con 0x559956eed400 auth_method 0
debug     -9> 2022-07-16T10:15:49.344+0000 7f39d1c28700 10 monclient: get_auth_request con 0x55994f31d000 auth_method 0
debug     -8> 2022-07-16T10:15:49.348+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959860400 auth_method 0
debug     -7> 2022-07-16T10:15:49.356+0000 7f39d2429700 10 monclient: get_auth_request con 0x55995993b000 auth_method 0
debug     -6> 2022-07-16T10:15:49.356+0000 7f39d1c28700 10 monclient: get_auth_request con 0x559958d4bc00 auth_method 0
debug     -5> 2022-07-16T10:15:49.356+0000 7f39d1427700 10 monclient: get_auth_request con 0x55995993b800 auth_method 0
debug     -4> 2022-07-16T10:15:49.360+0000 7f39d2429700 10 monclient: get_auth_request con 0x559959861800 auth_method 0
debug     -3> 2022-07-16T10:15:49.360+0000 7f39d1c28700 10 monclient: get_auth_request con 0x5599413c4c00 auth_method 0
debug     -2> 2022-07-16T10:15:49.368+0000 7f39d1427700 10 monclient: get_auth_request con 0x559959e1cc00 auth_method 0
debug     -1> 2022-07-16T10:15:49.376+0000 7f39b4d90700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0/rpm/el8/BUILD/ceph-17.2.0/src/osd/osd_internal_types.h: In function 'void ObjectContext::stop_block()' thread 7f39b4d90700 time 2022-07-16T10:15:49.377956+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.0/rpm/el8/BUILD/ceph-17.2.0/src/osd/osd_internal_types.h: 175: FAILED ceph_assert(blocked)

 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x559936dd744e]
 2: /usr/bin/ceph-osd(+0x5d766f) [0x559936dd766f]
 3: (PrimaryLogPG::kick_object_context_blocked(std::shared_ptr<ObjectContext>)+0x38b) [0x55993700481b]
 4: (PrimaryLogPG::process_copy_chunk(hobject_t, unsigned long, int)+0xc58) [0x5599370054c8]
 5: (C_Copyfrom::finish(int)+0xbf) [0x5599370b7bbf]
 6: (Context::complete(int)+0xd) [0x559936f1fd1d]
 7: (Finisher::finisher_thread_entry()+0x19d) [0x5599375c6e6d]
 8: /lib64/libpthread.so.0(+0x81ca) [0x7f39d64191ca]
 9: clone()

debug      0> 2022-07-16T10:15:49.376+0000 7f39b4d90700 -1 *** Caught signal (Aborted) **
 in thread 7f39b4d90700 thread_name:finisher

 ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f39d6423ce0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x559936dd74ac]
 5: /usr/bin/ceph-osd(+0x5d766f) [0x559936dd766f]
 6: (PrimaryLogPG::kick_object_context_blocked(std::shared_ptr<ObjectContext>)+0x38b) [0x55993700481b]
 7: (PrimaryLogPG::process_copy_chunk(hobject_t, unsigned long, int)+0xc58) [0x5599370054c8]
 8: (C_Copyfrom::finish(int)+0xbf) [0x5599370b7bbf]
 9: (Context::complete(int)+0xd) [0x559936f1fd1d]
 10: (Finisher::finisher_thread_entry()+0x19d) [0x5599375c6e6d]
 11: /lib64/libpthread.so.0(+0x81ca) [0x7f39d64191ca]
 12: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f39aa57b700 / osd_srv_heartbt
  7f39aad7c700 / tp_osd_tp
  7f39ab57d700 / tp_osd_tp
  7f39abd7e700 / tp_osd_tp
  7f39ac57f700 / tp_osd_tp
  7f39acd80700 / tp_osd_tp
  7f39ad581700 / tp_osd_tp
  7f39add82700 / tp_osd_tp
  7f39ae583700 / tp_osd_tp
  7f39aed84700 / tp_osd_tp
  7f39af585700 / tp_osd_tp
  7f39afd86700 / tp_osd_tp
  7f39b0587700 / tp_osd_tp
  7f39b0d88700 / tp_osd_tp
  7f39b1589700 / tp_osd_tp
  7f39b1d8a700 / tp_osd_tp
  7f39b258b700 / tp_osd_tp
  7f39b4d90700 / finisher
  7f39bb59d700 / ms_dispatch
  7f39bbd9e700 / ceph-osd
  7f39bd393700 / safe_timer
  7f39bdb94700 / cfin
  7f39be395700 / bstore_kv_sync
  7f39bfe1d700 / ms_dispatch
  7f39c339f700 / bstore_mempool
  7f39c4db0700 / rocksdb:low0
  7f39c75b5700 / fn_anonymous
  7f39c8db8700 / safe_timer
  7f39cd5c1700 / safe_timer
  7f39cec22700 / io_context_pool
  7f39d0425700 / io_context_pool
  7f39d0c26700 / admin_socket
  7f39d1427700 / msgr-worker-2
  7f39d1c28700 / msgr-worker-1
  7f39d2429700 / msgr-worker-0
  7f39d821f3c0 / ceph-osd
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2022-07-16T10:15:49.383614Z_0a14c5d0-cbb3-4cbd-93dd-ae49c471c78c/log
--- end dump of recent events ---

debug    -12> 2022-07-16T13:10:36.612+0000 7f051f64f700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/osd/MissingLoc.h: In function 'void MissingLoc::add_active_missing(const pg_missing_t&)' thread 7f051f64f700 time 2022-07-16T13:10:36.606949+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.1/rpm/el8/BUILD/ceph-17.2.1/src/osd/MissingLoc.h: 242: FAILED ceph_assert(0 == "unexpected need for missing item")

 ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55d55ebd88b4]
 2: /usr/bin/ceph-osd(+0x5d8ad5) [0x55d55ebd8ad5]
 3: (MissingLoc::add_active_missing(pg_missing_set<false> const&)+0x1f2) [0x55d55f020cb2]
 4: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x166a) [0x55d55efda2ba]
 5: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x336) [0x55d55effbf76]
 6: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Active, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb6) [0x55d55f02c8c6]
 7: (boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x55d55f02cd5a]
 8: (boost::statechart::simple_state<PeeringState::WaitUpThru, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa9) [0x55d55f02a859]
 9: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xb4) [0x55d55f0131a4]
 10: (PeeringState::activate_map(PeeringCtx&)+0x122) [0x55d55efcb5a2]
 11: (PG::handle_activate_map(PeeringCtx&)+0x127) [0x55d55ed9cd57]
 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x67c) [0x55d55ed0f0bc]
 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x55d55ed10c84]
 14: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d55efa87a6]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d55ed02a48]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d55f40ac84]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d55f40c024]
 18: /lib64/libpthread.so.0(+0x81ca) [0x7f0543cde1ca]
 19: clone()

debug    -11> 2022-07-16T13:10:36.620+0000 7f051ee4e700  5 osd.65 scrub-queue::remove_from_osd_queue removing pg[11.1a] failed. State was: not registered w/ OSD
debug    -10> 2022-07-16T13:10:36.624+0000 7f051a645700  5 osd.65 scrub-queue::remove_from_osd_queue removing pg[4.3] failed. State was: unregistering
debug     -9> 2022-07-16T13:10:36.624+0000 7f051a645700  5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.019136 3 0.002617
debug     -8> 2022-07-16T13:10:36.624+0000 7f051a645700  5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved
debug     -7> 2022-07-16T13:10:36.624+0000 7f051e64d700  5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000032 1 0.000034
debug     -6> 2022-07-16T13:10:36.624+0000 7f051e64d700  5 osd.65 pg_epoch: 32363 pg[4.3( v 32344'268028 lc 32344'267973 (29553'257807,32344'268028] local-lis/les=32239/32240 n=8894 ec=81/81 lis/c=32362/32239 les/c/f=32363/32240/0 sis=32362) [65,38,43]/[38,43] r=-1 lpr=32362 pi=[32239,32362)/1 luod=0'0 crt=32344'268028 lcod 0'0 mlcod 0'0 active+remapped m=2 mbc={}] enter Started/ReplicaActive/RepRecovering
debug     -5> 2022-07-16T13:10:36.624+0000 7f051f64f700 -1 *** Caught signal (Aborted) **
 in thread 7f051f64f700 thread_name:tp_osd_tp

 ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12ce0) [0x7f0543ce8ce0]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x55d55ebd8912]
 5: /usr/bin/ceph-osd(+0x5d8ad5) [0x55d55ebd8ad5]
 6: (MissingLoc::add_active_missing(pg_missing_set<false> const&)+0x1f2) [0x55d55f020cb2]
 7: (PeeringState::activate(ceph::os::Transaction&, unsigned int, PeeringCtxWrapper&)+0x166a) [0x55d55efda2ba]
 8: (PeeringState::Active::Active(boost::statechart::state<PeeringState::Active, PeeringState::Primary, PeeringState::Activating, (boost::statechart::history_mode)0>::my_context)+0x336) [0x55d55effbf76]
 9: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PeeringState::Active, PeeringState::PeeringMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb6) [0x55d55f02c8c6]
 10: (boost::statechart::simple_state<PeeringState::Peering, PeeringState::Primary, PeeringState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x16a) [0x55d55f02cd5a]
 11: (boost::statechart::simple_state<PeeringState::WaitUpThru, PeeringState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xa9) [0x55d55f02a859]
 12: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xb4) [0x55d55f0131a4]
 13: (PeeringState::activate_map(PeeringCtx&)+0x122) [0x55d55efcb5a2]
 14: (PG::handle_activate_map(PeeringCtx&)+0x127) [0x55d55ed9cd57]
 15: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x67c) [0x55d55ed0f0bc]
 16: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x55d55ed10c84]
 17: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55d55efa87a6]
 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xaf8) [0x55d55ed02a48]
 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55d55f40ac84]
 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55d55f40c024]
 21: /lib64/libpthread.so.0(+0x81ca) [0x7f0543cde1ca]
 22: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

debug     -4> 2022-07-16T13:10:36.624+0000 7f051e64d700  5 osd.65 scrub-queue::remove_from_osd_queue removing pg[11.13] failed. State was: unregistering
debug     -3> 2022-07-16T13:10:36.628+0000 7f051e64d700  5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] exit Started/ReplicaActive/RepNotRecovering 0.016345 3 0.006503
debug     -2> 2022-07-16T13:10:36.628+0000 7f051e64d700  5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] enter Started/ReplicaActive/RepWaitRecoveryReserved
debug     -1> 2022-07-16T13:10:36.628+0000 7f051a645700  5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] exit Started/ReplicaActive/RepWaitRecoveryReserved 0.000036 1 0.000034
debug      0> 2022-07-16T13:10:36.628+0000 7f051a645700  5 osd.65 pg_epoch: 32363 pg[11.13( v 32359'128288072 lc 32056'128286111 (31702'128276029,32359'128288072] local-lis/les=32035/32036 n=52993 ec=15753/15753 lis/c=32362/32035 les/c/f=32363/32037/0 sis=32362) [65,22,67]/[22,67] r=-1 lpr=32362 pi=[32035,32362)/1 luod=0'0 crt=32359'128288072 lcod 0'0 mlcod 0'0 active+remapped m=862 mbc={}] enter Started/ReplicaActive/RepRecovering
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_cleaner
   0/ 5 seastore_lba
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 alienstore
   1/ 5 mclock
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f0517e40700 / osd_srv_heartbt
  7f0518641700 / tp_osd_tp
  7f0518e42700 / tp_osd_tp
  7f0519643700 / tp_osd_tp
  7f0519e44700 / tp_osd_tp
  7f051a645700 / tp_osd_tp
  7f051ae46700 / tp_osd_tp
  7f051b647700 / tp_osd_tp
  7f051be48700 / tp_osd_tp
  7f051c649700 / tp_osd_tp
  7f051ce4a700 / tp_osd_tp
  7f051d64b700 / tp_osd_tp
  7f051de4c700 / tp_osd_tp
  7f051e64d700 / tp_osd_tp
  7f051ee4e700 / tp_osd_tp
  7f051f64f700 / tp_osd_tp
  7f051fe50700 / tp_osd_tp
  7f0528e62700 / ms_dispatch
  7f0529663700 / ceph-osd
  7f052ac58700 / safe_timer
  7f052b459700 / cfin
  7f052d6e2700 / ms_dispatch
  7f0530c64700 / bstore_mempool
  7f0534e7a700 / fn_anonymous
  7f053667d700 / safe_timer
  7f053ae86700 / safe_timer
  7f053c4e7700 / io_context_pool
  7f053dcea700 / io_context_pool
  7f053e4eb700 / admin_socket
  7f053ecec700 / msgr-worker-2
  7f053f4ed700 / msgr-worker-1
  7f053fcee700 / msgr-worker-0
  7f0545ae43c0 / ceph-osd
  max_recent     10000
  max_new        10000
  log_file /var/lib/ceph/crash/2022-07-16T13:10:36.631036Z_67936d50-2f11-4995-92f3-5b6d51820b3d/log
--- end dump of recent events ---
Actions #1

Updated by Chris Kul almost 2 years ago

BTW the initial version was 17.2.0, we tried to update to 17.2.1 in hope this bug got fixed, sadly without luck.

Actions #2

Updated by Nitzan Mordechai almost 2 years ago

  • Assignee set to Nitzan Mordechai
Actions #3

Updated by Nitzan Mordechai almost 2 years ago

@Chris Kul, I'm trying to understand the sequence of failing osd's, can you please upload the osds logs that failed?
Does it cache tier ?

Actions #4

Updated by Chris Kul almost 2 years ago

Sadly i dont have any more logs anymore, as i had to destroy the ceph - getting it back in working order was top priority and i was unable to contact anyone via IRC or tracker (as it took a while to get access) to get the information which files i need to backup.

I just can remember, that the OSDs which had the PGs which were unavailable were affected the most, but in fact all OSDs which offered the cache were affected.

yes it was a cache tier - it was the writeback cache for the default pool "ceph".

Actions #5

Updated by Nitzan Mordechai almost 2 years ago

  • Assignee changed from Nitzan Mordechai to Myoungwon Oh

Myoungwon Oh, can you please take a look?

Actions #6

Updated by Myoungwon Oh over 1 year ago

Hm... at first glance, OSD calls stop_block() on a head object, which is already stopped, in kick_object_context_blocked(). I don't have any idea for now with regard to why this assert occurs because head_obc in kick_object_context_blocked() can not be interrupted by other tasks. I guess that this might be related with rollback, but I need more log to figure out the root cause.

Actions #7

Updated by Chris Kul over 1 year ago

as i said, i dont have any more logs, as i had to bring the cluster back in a working state.

As this issue is coming with a complete data loss, do you see any other chance to find a solution for it?

Actions #8

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Need More Info

Moving into Need More Info :-( per Myoungwon Oh's comment.

Actions #9

Updated by Josh Durgin over 1 year ago

It looks like this must be some inconsistency between the head_obc->blocked state, and its presence in objects_blocked_on_snap_promotion

Actions

Also available in: Atom PDF