Project

General

Profile

Bug #64373

Updated by Radoslaw Zarzynski 3 months ago

It looks like watch timer could trigger PG transaction on an already closed ObjectStore. 
 This could happen when fast shutdown is enabled, BlueStore uses non-rotational drives and is being shutdown and some Watch left deregistered.  
 If Watch timeout occurs during the shutdown process PG attempts to issue transaction against the store and segfaults. 

 <pre> 
 


 2024-02-08T22:14:53.892-0600 7f045fcb3700 -1 osd.200 2472348 *** Got signal Terminated *** 
 2024-02-08T22:14:53.892-0600 7f045fcb3700 -1 osd.200 2472348 *** Immediate shutdown (osd_fast_shutdown=true) *** 
 2024-02-08T22:14:53.892-0600 7f045fcb3700    0 osd.200 2472348 Fast Shutdown: - cct->_conf->osd_fast_shutdown = 1, null-fm = 1 
 2024-02-08T22:14:53.892-0600 7f045fcb3700 -1 osd.200 2472348 *** Immediate shutdown (osd_fast_shutdown=true) *** 
 2024-02-08T22:14:53.892-0600 7f045fcb3700    0 osd.200 2472348 prepare_to_stop telling mon we are shutting down and dead 
 Stopping Ceph object storage daemon osd.200... 
 2024-02-08T22:14:54.132-0600 7f0453bb0700    0 osd.200 2472348 got_stop_ack starting shutdown 
 2024-02-08T22:14:54.132-0600 7f045fcb3700    0 osd.200 2472348 prepare_to_stop starting shutdown 
 2024-02-08T22:15:01.956-0600 7f045fcb3700    1 bdev(0x55dc906de400 /var/lib/ceph/osd/ceph-200/block) close 
 *** Caught signal (Segmentation fault) ** 
 in thread 7f0447e1f700 thread_name:safe_timer 
 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 
  1: /lib64/libpthread.so.0(+0x168c0) [0x7f0465a0c8c0] 
  2: (BlueStore::_txc_create(BlueStore::Collection*, BlueStore::OpSequencer*, std::__cxx11::list<Context*, std::allocator<Context*> >*, boost::intrusive_ptr<TrackedOp>)+0x406) [0x55dc8dc10816] 
  3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x21f) [0x55dc8dc9065f] 
  4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x4f) [0x55dc8d87255f] 
  5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x7c4) [0x55dc8daa60d4] 
  6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55dc8d7f043d] 
  7: (PrimaryLogPG::simple_opc_submit(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >)+0x56) [0x55dc8d7f2206] 
  8: (PrimaryLogPG::handle_watch_timeout(std::shared_ptr<Watch>)+0x988) [0x55dc8d7f48b8] 
  9: (HandleWatchTimeout::complete(int)+0x11a) [0x55dc8d75f3ca] 
  10: (CommonSafeTimer<std::mutex>::timer_thread()+0x128) [0x55dc8dddc798] 
  11: (CommonSafeTimerThread<std::mutex>::entry()+0xd) [0x55dc8dddd7dd] 
  12: /lib64/libpthread.so.0(+0xa6ea) [0x7f0465a006ea] 
  13: clone() 
 </pre> 

Back