Project

General

Profile

Bug #21174

OSD crash: 903: FAILED assert(objiter->second->version > last_divergent_update)

Added by Martin Millnert about 2 years ago. Updated 4 months ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
Start date:
08/30/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

I've setup a cephfs erasure coded pool on a small cluster consisting of 5 bluestore OSDs.
The pools were created as follows:

ceph osd pool create cephfs_metadata 160 160 replicated
ceph osd pool create cephfs_data 160 160 erasure ecpool ec
ceph osd pool set cephfs_data allow_ec_overwrites true
ceph fs new cephfs cephfs_metadata cephfs_data

I started copying files onto the cephfs, which has now started to crash in endless loops. Cluster is unavailable (which is uncritical for me but not for a "live cluster".)
The log which is available at https://martin.millnert.se/files/cephfs_ec/ceph-osd.1.log.all.gz crashes in the vicinity of a lot of output about "omap" operations.

In the documentation at http://docs.ceph.com/docs/master/rados/operations/erasure-code/ it is stated that erasure coded pools do not support omap operations, which is why special care has to be taken with RBD.
For CephFS, it simply states: "For Cephfs, using an erasure coded pool means setting that pool in a file layout." with a link to the section on CephFS file layouts: http://docs.ceph.com/docs/master/cephfs/file-layouts/
The file layouts documentation section does not reciprocate this link, i.e. the logic/context of using erasure coded pools with cephfs is not further explained there.

So, provided I've done a user error here and there is no other bug causing my OSDs to crash, I think it would be wise to upgrade the documentation on how to use EC pools for CephFS more explicitly.

Furthermore, if it is indeed illegal to crease a cephfs using the command I did, i.e. "ceph fs new cephfs <replicated_metadata_pool> <erasure_coded_data_pool>", probably the code should test and reject that to avoid cluster down states further down the road.


Related issues

Related to RADOS - Bug #39023: osd/PGLog: preserve original_crt to check rollbackability Resolved 03/28/2019
Related to RADOS - Bug #16279: assert(objiter->second->version > last_divergent_update) failed Closed 06/14/2016

History

#1 Updated by John Spray about 2 years ago

  • Subject changed from CephFS erasure coded pool not explained in documentation to OSD crash using EC pool for CephFS
ceph-osd.1.log.1:   -18> 2017-08-30 06:24:40.454664 7ff53aa51700 30 bluestore.OnodeSpace(0x5589cd48a150 in 0x5589c09275e0) lookup
ceph-osd.1.log.1:   -17> 2017-08-30 06:24:40.454667 7ff53aa51700 30 bluestore.OnodeSpace(0x5589cd48a150 in 0x5589c09275e0) lookup 4#9:41000000::::head# hit 0x5589e0bb6400
ceph-osd.1.log.1:   -16> 2017-08-30 06:24:40.454671 7ff53aa51700 15 bluestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 9.82s4_head 4#9:41000000::::head#
ceph-osd.1.log.1:   -15> 2017-08-30 06:24:40.454674 7ff53aa51700 30 bluestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys  0x0000000000000b96'._biginfo' <- _biginfo
ceph-osd.1.log.1:   -14> 2017-08-30 06:24:40.454680 7ff53aa51700 30 bluestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys  0x0000000000000b96'._epoch' <- _epoch
ceph-osd.1.log.1:   -13> 2017-08-30 06:24:40.454683 7ff53aa51700 30 bluestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys  0x0000000000000b96'._info' <- _info
ceph-osd.1.log.1:   -12> 2017-08-30 06:24:40.454686 7ff53aa51700 10 bluestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 9.82s4_head 4#9:41000000::::head# = 0
ceph-osd.1.log.1:   -11> 2017-08-30 06:24:40.454690 7ff53aa51700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_calc_cost 0x558a22711440 cost 671619 (1 ios * 670000 + 1619 bytes)
ceph-osd.1.log.1:   -10> 2017-08-30 06:24:40.454693 7ff53aa51700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes txc 0x558a22711440 onodes  shared_blobs 
ceph-osd.1.log.1:    -9> 2017-08-30 06:24:40.454695 7ff53aa51700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_finalize_kv txc 0x558a22711440 allocated 0x[] released 0x[]
ceph-osd.1.log.1:    -8> 2017-08-30 06:24:40.454699 7ff53aa51700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_state_proc txc 0x558a22711440 prepare
ceph-osd.1.log.1:    -7> 2017-08-30 06:24:40.454702 7ff53aa51700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_finish_io 0x558a22711440
ceph-osd.1.log.1:    -6> 2017-08-30 06:24:40.454704 7ff53aa51700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_state_proc txc 0x558a22711440 io_done
ceph-osd.1.log.1:    -5> 2017-08-30 06:24:40.454713 7ff53aa51700  5 osd.1 pg_epoch: 8479 pg[9.9as1( v 7143'1973 lc 7073'1968 (6871'300,7143'1973] local-lis/les=8435/8436 n=1863 ec=216/216 lis/c 8435/6886 les/c/f 8436/6888/0 8478/8478/8451) [4,1,2147483647,3,5] r=1 lpr=8478 pi=[6886,8478)/5 crt=7143'1973 lcod 0'0 unknown NOTIFY m=5] exit Started/Stray 0.590322 8 0.000340
ceph-osd.1.log.1:    -4> 2017-08-30 06:24:40.454732 7ff53aa51700  5 osd.1 pg_epoch: 8479 pg[9.9as1( v 7143'1973 lc 7073'1968 (6871'300,7143'1973] local-lis/les=8435/8436 n=1863 ec=216/216 lis/c 8435/6886 les/c/f 8436/6888/0 8478/8478/8451) [4,1,2147483647,3,5] r=1 lpr=8478 pi=[6886,8478)/5 crt=7143'1973 lcod 0'0 unknown NOTIFY m=5] enter Started/ReplicaActive
ceph-osd.1.log.1:    -3> 2017-08-30 06:24:40.454737 7ff54f292700  5 -- 192.168.121.12:6818/30788 >> 192.168.121.12:6802/30664 conn(0x5589d6d47000 :6818 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7 cs=1 l=0). rx osd.5 seq 263 0x558a2254b000 pg_info((query:8479 sent:8479 5.72( v 213'5627 (154'4127,213'5627] local-lis/les=8478/8479 n=428 ec=99/99 lis/c 8478/8478 les/c/f 8479/8479/0 8478/8478/8471))=(empty) epoch 8479) v5
ceph-osd.1.log.1:    -2> 2017-08-30 06:24:40.454745 7ff53aa51700  5 osd.1 pg_epoch: 8479 pg[9.9as1( v 7143'1973 lc 7073'1968 (6871'300,7143'1973] local-lis/les=8435/8436 n=1863 ec=216/216 lis/c 8435/6886 les/c/f 8436/6888/0 8478/8478/8451) [4,1,2147483647,3,5] r=1 lpr=8478 pi=[6886,8478)/5 crt=7143'1973 lcod 0'0 unknown NOTIFY m=5] enter Started/ReplicaActive/RepNotRecovering
ceph-osd.1.log.1:    -1> 2017-08-30 06:24:40.454764 7ff544303700  1 -- 192.168.121.12:6818/30788 <== osd.5 192.168.121.12:6802/30664 263 ==== pg_info((query:8479 sent:8479 5.72( v 213'5627 (154'4127,213'5627] local-lis/les=8478/8479 n=428 ec=99/99 lis/c 8478/8478 les/c/f 8479/8479/0 8478/8478/8471))=(empty) epoch 8479) v5 ==== 934+0+0 (3514905517 0 0) 0x558a2254b000 con 0x5589d6d47000
ceph-osd.1.log.1:     0> 2017-08-30 06:24:40.454765 7ff53a250700 -1 /build/ceph-12.1.4/src/osd/PGLog.h: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd_pglog::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd_pglog::list<pg_log_entry_t> = std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14u, pg_log_entry_t> >]' thread 7ff53a250700 time 2017-08-30 06:24:40.450894
ceph-osd.1.log.1:/build/ceph-12.1.4/src/osd/PGLog.h: 903: FAILED assert(objiter->second->version > last_divergent_update)
ceph-osd.1.log.1:
ceph-osd.1.log.1: ceph version 12.1.4 (a5f84b37668fc8e03165aaf5cbb380c78e4deba4) luminous (rc)
ceph-osd.1.log.1: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5589b743b1c2]
ceph-osd.1.log.1: 2: (void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x3093) [0x5589b6fc5073]
ceph-osd.1.log.1: 3: (void PGLog::_merge_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, std::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> >&, pg_info_t const&, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x1d7) [0x5589b6fc5697]
ceph-osd.1.log.1: 4: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xbb4) [0x5589b6fbe374]
ceph-osd.1.log.1: 5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x5589b6f1f904]
ceph-osd.1.log.1: 6: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x2ce) [0x5589b6f6b0ce]
ceph-osd.1.log.1: 7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x354) [0x5589b6faf984]
ceph-osd.1.log.1: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x5589b6f8cecb]
ceph-osd.1.log.1: 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x21) [0x5589b6f8d0e1]
ceph-osd.1.log.1: 10: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1eb) [0x5589b6f5d5db]
ceph-osd.1.log.1: 11: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x200) [0x5589b6ea9ca0]
ceph-osd.1.log.1: 12: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x5589b6f097d7]
ceph-osd.1.log.1: 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x5589b7441eff]
ceph-osd.1.log.1: 14: (ThreadPool::WorkThread::entry()+0x10) [0x5589b7442e30]
ceph-osd.1.log.1: 15: (()+0x7494) [0x7ff552632494]
ceph-osd.1.log.1: 16: (clone()+0x3f) [0x7ff5516b9aff]

#2 Updated by John Spray about 2 years ago

  • Subject changed from OSD crash using EC pool for CephFS to OSD crash: 903: FAILED assert(objiter->second->version > last_divergent_update)

Martin: just to confirm, you were seeing this crash while you had EC pools involved, and when you do not have any EC pools you do not see the crash?

#3 Updated by Martin Millnert about 2 years ago

To clarify then: I have not tested this with a replicated cephfs data pool. Only tested with ec data pool as per my 4 commands above.

I can do one of two things:

1. I can scrap the cephfs fs and the cephfs ec pool (cephfs_data above) and recreate it with a replicated pool. There is no data loss if I scrap the fs and start over.
2. I can wait out for some hopeful fix and not have to scrap ~2 days worth of IO.

If you want me to do 1) for double verification of replicated, I can do that. Just let me know.

#4 Updated by George Zhao about 2 years ago

I've got exactly the same problem with kernel client. But fuse client seems fine with ec pool on cephfs

#6 Updated by Sage Weil 6 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)

#7 Updated by Grant Slater 6 months ago

I have a similar issue with OSDs dropping out:

/build/ceph-13.2.5/src/osd/PGLog.h: 923: FAILED assert(objiter->second->version > last_divergent_update)

ceph-post-file: 3c7be3d8-dea9-4cf9-91ef-470e81d1b29f
ceph-post-file: 278918b9-dbcb-4bd8-ac7d-ab1210aaefc5

#8 Updated by Sage Weil 6 months ago

  • Status changed from New to Verified
  • Priority changed from Normal to Urgent
  -103> 2019-03-16 23:01:20.667 7fd576365700 10 osd.5 pg_epoch: 65479 pg[1.96s3( v 5163'138877 lc 3142'138872 (2469'135808,5163'138877] local-lis/les=5371/5372 n=66596 ec=71/51 lis/c 65428/5139 les/c/f 65429/5144/0 65477/65477/65469) [4,6,3,5,2147483647,1]p4(0) r=3 lpr=65477 pi=[5139,65477)/24 crt=5163'138877 lcod 0'0 unknown NOTIFY m=2 mbc={}] state<Started/Stray>: got info+log from osd.4(0) 1.96s0( v 7471'138879 (2469'135808,7471'138879] local-lis/les=65477/65479 n=66596 ec=71/51 lis/c 65428/5139 les/c/f 65429/5144/0 65477/65477/65469) log((5162'138874,7471'138879], crt=7471'138879)
  -102> 2019-03-16 23:01:20.667 7fd576365700 10 merge_log log((5162'138874,7471'138879], crt=7471'138879) from osd.4(0) into log((2469'135808,5163'138877], crt=5163'138877)
  -101> 2019-03-16 23:01:20.667 7fd576365700 20 pg_missing_t sobject: 1:694806f7:::20000a2c83b.00000093:head
  -100> 2019-03-16 23:01:20.667 7fd576365700 20 pg_missing_t sobject: 1:69553ec9:::20000a2c83a.00000018:head
   -99> 2019-03-16 23:01:20.667 7fd576365700 10 merge_log extending head to 7471'138879
   -98> 2019-03-16 23:01:20.667 7fd576365700 20   ? 7471'138879 (3082'138822) modify   1:696a96d1:::20000a2c834.000000bc:head by client.1456106.0:2358 2018-06-14 23:24:30.534420 0
   -97> 2019-03-16 23:01:20.667 7fd576365700 20   ? 7309'138878 (3085'138826) modify   1:697397ba:::20000a2c838.000000c1:head by client.1456106.0:2327 2018-06-14 23:24:30.266415 0
   -95> 2019-03-16 23:01:20.667 7fd576365700 20   ? 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0
   -94> 2019-03-16 23:01:20.667 7fd576365700 20 merge_log cut point (usually last shared) is 5162'138874
   -59> 2019-03-16 23:01:20.667 7fd576365700 10 merge_log divergent 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0
   -58> 2019-03-16 23:01:20.667 7fd576365700 20 update missing, append 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0
   -57> 2019-03-16 23:01:20.667 7fd576365700 20 update missing, append 7309'138878 (3085'138826) modify   1:697397ba:::20000a2c838.000000c1:head by client.1456106.0:2327 2018-06-14 23:24:30.266415 0
   -56> 2019-03-16 23:01:20.667 7fd576365700 20 update missing, append 7471'138879 (3082'138822) modify   1:696a96d1:::20000a2c834.000000bc:head by client.1456106.0:2358 2018-06-14 23:24:30.534420 0
   -54> 2019-03-16 23:01:20.667 7fd576365700 20 _merge_object_divergent_entries: merging hoid 1:694806f7:::20000a2c83b.00000093:head entries: 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0
   -53> 2019-03-16 23:01:20.667 7fd576365700 20 _merge_object_divergent_entries: keeping 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0
   -51> 2019-03-16 23:01:20.667 7fd576365700 10 _merge_object_divergent_entries: hoid 1:694806f7:::20000a2c83b.00000093:head prior_version: 3085'138827 first_divergent_update: 5338'138875 last_divergent_update: 5338'138875
   -49> 2019-03-16 23:01:20.667 7fd576365700 10 _merge_object_divergent_entries: more recent entry found: 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0, already merged
     0> 2019-03-16 23:01:20.671 7fd576365700 -1 /build/ceph-13.2.5/src/osd/PGLog.h: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd_pglog::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd_pglog::list<pg_log_entry_t> = std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> >]' thread 7fd576365700 time 2019-03-16 23:01:20.672292
/build/ceph-13.2.5/src/osd/PGLog.h: 923: FAILED assert(objiter->second->version > last_divergent_update)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fd5979365c2]
 2: (()+0x2f9787) [0x7fd597936787]
 3: (void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x39b0) [0x872e10]
 4: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xe30) [0x86c700]
 5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x7c9cc4]
 6: (PG::RecoveryState::Stray::react(MLogRec const&)+0xef) [0x810e0f]
 7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1a8) [0x85f9b8]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x83262b]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x818293]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x7594af]
 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x9c5ec0]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x769240]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7fd59793c6f6]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fd59793d8b0]
 15: (()+0x76ba) [0x7fd595faa6ba]
 16: (clone()+0x6d) [0x7fd5955b941d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

from 3c7be3d8-dea9-4cf9-91ef-470e81d1b29f

osd.2 also crashes, but differently, see 278918b9-dbcb-4bd8-ac7d-ab1210aaefc5

 -5838> 2019-03-16 23:00:50.448 7faad104f700 10 osd.2 pg_epoch: 65473 pg[1.cas2( v 5163'140311 (2469'137238,5163'140311] local-lis/les=5256/5259 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5163'140311 lcod 0'0 unknown NOTI
FY mbc={}] state<Started/Stray>: got info+log from osd.0(0) 1.cas0( v 5281'140312 (2469'137238,5281'140312] local-lis/les=65471/65473 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) log((5163'140309,5281'140312], crt=5163'140311)
 -5837> 2019-03-16 23:00:50.448 7faad104f700 10 merge_log log((5163'140309,5281'140312], crt=5163'140311) from osd.0(0) into log((2469'137238,5163'140311], crt=5163'140311)
 -5836> 2019-03-16 23:00:50.448 7faad104f700 10 merge_log extending head to 5281'140312
 -5835> 2019-03-16 23:00:50.448 7faad104f700 20   ? 5281'140312 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0
 -5834> 2019-03-16 23:00:50.448 7faad104f700 20 merge_log cut point (usually last shared) is 5163'140309
 -5793> 2019-03-16 23:00:50.448 7faad104f700 10 merge_log divergent 5163'140310 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0
 -5792> 2019-03-16 23:00:50.448 7faad104f700 10 merge_log divergent 5163'140311 (3082'140263) modify   1:537949df:::20000a2c834.00000105:head by client.1456106.0:3338 2018-06-14 23:25:29.223625 0
 -5790> 2019-03-16 23:00:50.448 7faad104f700 20 update missing, append 5281'140312 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0
 -5785> 2019-03-16 23:00:50.448 7faad104f700 20 _merge_object_divergent_entries: merging hoid 1:5368c51a:::20000a2c835.000000f3:head entries: 5163'140310 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0
 -5784> 2019-03-16 23:00:50.448 7faad104f700 20 _merge_object_divergent_entries: keeping 5163'140310 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0
 -5782> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:5368c51a:::20000a2c835.000000f3:head prior_version: 3082'140264 first_divergent_update: 5163'140310 last_divergent_update: 5163'140310
 -5780> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: more recent entry found: 5281'140312 (3082'140264) modify   1:5368c51a:::20000a2c835.000000f3:head by client.1456106.0:2894 2018-06-14 23:25:03.831106 0, already merged
 -5777> 2019-03-16 23:00:50.448 7faad104f700 20 _merge_object_divergent_entries: merging hoid 1:537949df:::20000a2c834.00000105:head entries: 5163'140311 (3082'140263) modify   1:537949df:::20000a2c834.00000105:head by client.1456106.0:3338 2018-06-14 23:25:29.223625 0
 -5776> 2019-03-16 23:00:50.448 7faad104f700 20 _merge_object_divergent_entries: keeping 5163'140311 (3082'140263) modify   1:537949df:::20000a2c834.00000105:head by client.1456106.0:3338 2018-06-14 23:25:29.223625 0
 -5775> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head prior_version: 3082'140263 first_divergent_update: 5163'140311 last_divergent_update: 5163'140311
 -5773> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head has no more recent entries in log
 -5772> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head must be rolled back or recovered, attempting to rollback
 -5770> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head rolling back 5163'140311 (3082'140263) modify   1:537949df:::20000a2c834.00000105:head by client.1456106.0:3338 2018-06-14 23:25:29.223625 0
 -5768> 2019-03-16 23:00:50.448 7faad104f700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head rolled back
 -5766> 2019-03-16 23:00:50.448 7faad104f700 10 merge_log result log((2469'137238,5281'140312], crt=5281'140312) missing(1 may_include_deletes = 1) changed=1
 -5765> 2019-03-16 23:00:50.448 7faad104f700  5 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140311 (2469'137238,5281'140312] local-lis/les=5256/5259 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown NOTIFY m=1 mbc={}] exit Started/Stray 3.249886 10 0.000241
 -5764> 2019-03-16 23:00:50.448 7faad104f700  5 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140311 (2469'137238,5281'140312] local-lis/les=5256/5259 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown NOTIFY m=1 mbc={}] enter Started/ReplicaActive
 -5762> 2019-03-16 23:00:50.448 7faad104f700  5 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140311 (2469'137238,5281'140312] local-lis/les=5256/5259 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown NOTIFY m=1 mbc={}] enter Started/ReplicaActive/RepNotRecovering
 -5761> 2019-03-16 23:00:50.448 7faad104f700 10 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140311 (2469'137238,5281'140312] local-lis/les=5256/5259 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown NOTIFY m=1 mbc={}] state<Started/ReplicaActive>: In ReplicaActive, about to call activate
 -5760> 2019-03-16 23:00:50.448 7faad104f700 10 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140311 (2469'137238,5281'140312] local-lis/les=65471/65473 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown m=1 mbc={}] activate - not complete, missing(1 may_include_deletes = 1)
 -5759> 2019-03-16 23:00:50.448 7faad104f700 10 osd.2 pg_epoch: 65473 pg[1.cas2( v 5281'140312 lc 5163'140309 (2469'137238,5281'140312] local-lis/les=65471/65473 n=67222 ec=71/51 lis/c 65422/5135 les/c/f 65424/5144/0 65471/65471/65469) [0,2147483647,2,3,4,7]p0(0) r=2 lpr=65471 pi=[5135,65471)/18 crt=5281'140312 lcod 0'0 unknown m=1 mbc={}] state<Started/ReplicaActive>: Activate Finished
 -5758> 2019-03-16 23:00:50.448 7faad104f700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 5163'140309, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0
    -3> 2019-03-16 23:00:50.524 7faad104f700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 2, counting from 0)
    -2> 2019-03-16 23:00:50.524 7faad104f700 -1 bluestore(/var/lib/ceph/osd/ceph-2) ENOENT on clone suggests osd bug
    -1> 2019-03-16 23:00:50.524 7faad104f700  0 bluestore(/var/lib/ceph/osd/ceph-2) _dump_transaction transaction dump:
...

#9 Updated by Martin Millnert 6 months ago

For completeness: The root cause for the crashes I experienced were that I had oversized RADOS objects (2-10GB, max recommended is in the <100 MBs or so (see docs for proper answer)) in my non-erasure coded, non-BlueStore RADOS objects. EC-pool on BlueStore (not clear on the exact relation here) enforced strict bounds on RADOS object sizes.

I worked around the issue by setting up cephfs and migrating out the objects onto that instead.

#10 Updated by Martin Millnert 6 months ago

Err. I believe I mixed up two different bugs, please disregard my previous comment. I don't currently recall what I did to solve the above on my end.

#11 Updated by Neha Ojha 6 months ago

Hi Grant,

Looking at the logs, it seems that the first crash was seen on osd.2 on pg id 1.cas2

2018-10-12 06:27:49.044 7f6036b96700 -1 bluestore(/var/lib/ceph/osd/ceph-2) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 2, counting from 0)
2018-10-12 06:27:49.044 7f6036b96700 -1 bluestore(/var/lib/ceph/osd/ceph-2) ENOENT on clone suggests osd bug

before the crash, this is what happened

2018-10-12 06:27:49.016 7f6036b96700 10 merge_log log((5163'140309,5281'140312], crt=5163'140311) from osd.0(0) into log((2469'137238,5163'140311], crt=5163'140311)
2018-10-12 06:27:49.016 7f6036b96700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head rolling back 5163'140311 (3082'140263) modify   1:537949df:::20000a2c834.00000105:head by client.1456106.0:3338 2018-06-14 23:25:29.223625 0
2018-10-12 06:27:49.016 7f6036b96700 10 _merge_object_divergent_entries: hoid 1:537949df:::20000a2c834.00000105:head rolled back

FWIW, the crash seems similar to https://tracker.ceph.com/issues/36598, but could have a totally different root cause, which resulted in the object not being present.

The failure on osd.5 is seen afterwards on pg id 1.96s3

2018-10-12 06:33:47.622 7f1f0bfd8700 -1 /build/ceph-13.2.2/src/osd/PGLog.h: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd_pglog::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd_pglog::list<pg_log_entry_t> = std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> >]' thread 7f1f0bfd8700 time 2018-10-12 06:33:47.624287
/build/ceph-13.2.2/src/osd/PGLog.h: 923: FAILED assert(objiter->second->version > last_divergent_update)

This happens while doing the following.

2018-10-12 06:33:47.618 7f1f0bfd8700 10 merge_log log((5162'138874,7471'138879], crt=7471'138879) from osd.4(0) into log((2469'135808,5163'138877], crt=5163'138877)

when it finds

2018-10-12 06:33:47.618 7f1f0bfd8700 10 _merge_object_divergent_entries: hoid 1:694806f7:::20000a2c83b.00000093:head prior_version: 3085'138827 first_divergent_update: 5338'138875 last_divergent_update: 5338'138875
2018-10-12 06:33:47.618 7f1f0bfd8700 10 _merge_object_divergent_entries: more recent entry found: 5338'138875 (3085'138827) modify   1:694806f7:::20000a2c83b.00000093:head by client.1456106.0:1718 2018-06-14 23:23:37.541317 0, already merged

and asserts because objiter->second->version(5338'138875) is equal to last_divergent_update: 5338'138875.

I have more investigation to do on this, but it will be useful to have the logs from osd.0 and osd.4, which were the corresponding primaries.

#12 Updated by Grant Slater 6 months ago

As requested.

osd.0: ceph-post-file: 17efe900-501c-479f-ba56-dd29fef18c58
osd.4: ceph-post-file: ff22f830-e6bc-4ffc-a051-3c3145ac638d

#13 Updated by Neha Ojha 6 months ago

It is possible that the crash we are seeing on osd.2 is due to 1:537949df:::20000a2c834.00000105:head incorrectly rolling back in Case 5 of _merge_object_divergent_entries().

if (!i->can_rollback() || i->version <= olog_can_rollback_to) decides whether we can_rollback or not, i->version here is 5163'140311, but looking at the logs it is hard to find out what the value of olog_can_rollback_to was. All we know is that we decided to rollback and after merge_log() finished, crt=5281'140312.

We might need to preserve the original crt on the merge target, something like what we were trying to do in https://github.com/ceph/ceph/pull/26347. But, it is hard to tell without knowing more details.

I have a PR that explicitly prints olog_can_rollback_to https://github.com/ceph/ceph/pull/27105, before making that decision.

Grant, how easily can you reproduce this? Would it be possible for you to apply a patch and reproduce this failure?

#14 Updated by Grant Slater 6 months ago

Yes I can still reproduce it, the cluster is still in a broken state.

  cluster:
    id:     08ff77c3-5f4c-4757-afc8-56ea67bf4343
    health: HEALTH_WARN
            1 filesystem is degraded
            noout,noscrub,nodeep-scrub flag(s) set
            2 osds down
            Reduced data availability: 134 pgs inactive, 47 pgs down, 81 pgs incomplete
            Degraded data redundancy: 18090022/209172759 objects degraded (8.648%), 295 pgs degraded, 295 pgs undersized

  services:
    mon: 5 daemons, quorum cluster-01,cluster-02,cluster-03,cluster-04,cluster-05
    mgr: cluster-02(active), standbys: cluster-05, cluster-03, cluster-01, cluster-04
    mds: cephfs-4/4/4 up  {0=cluster-03=up:replay,1=cluster-01=up:resolve,2=cluster-02=up:resolve,3=cluster-05=up:resolve}, 1 up:standby
    osd: 24 osds: 22 up, 24 in; 9 remapped pgs
         flags noout,noscrub,nodeep-scrub

  data:
    pools:   2 pools, 576 pgs
    objects: 35.38 M objects, 16 TiB
    usage:   34 TiB used, 59 TiB / 93 TiB avail
    pgs:     23.264% pgs not active
             18090022/209172759 objects degraded (8.648%)
             290 active+undersized+degraded
             152 active+clean
             77  incomplete
             43  down
             5   undersized+degraded+peered
             4   remapped+incomplete
             4   down+remapped
             1   remapped

I have built and deployed with the https://github.com/ceph/ceph/pull/27105 logging patch.

Attached are the logs for the 2x OSD that mainly crash:
osd.2: ceph-post-file: 239e8b8c-70b6-479a-a822-f2813b1b9991
osd.5: ceph-post-file: 7c843140-42d1-4cda-a9ba-a04354fe5fca

Attached are the logs for the other OSD that occasionally crash:
osd.0: ceph-post-file: ffa15979-c3be-44cc-b563-b4f6a72edb2d
osd.1: ceph-post-file: 5bdebf20-f9e6-4b27-8acd-1f374a761b2a
osd.4: ceph-post-file: b65b6b2e-5500-48a7-83cf-5ec727b6b9aa

#15 Updated by Neha Ojha 6 months ago

https://github.com/ceph/ceph/pull/27200 attempts to resolve the failure seen on osd.2

#16 Updated by Grant Slater 6 months ago

I am still seeing crashes with https://github.com/ceph/ceph/pull/27200 backported.

Attached are logs.

osd.2 ceph-post-file: fb80a2a5-e373-42e4-ad30-27121b284cf7
osd.5 ceph-post-file: 41b6074c-8480-47a3-a759-b2729c5d0e87

#17 Updated by Neha Ojha 6 months ago

Hi Grant,

Thanks for applying the patch and updating the logs. Looks like the earlier crash on osd.2(ENOENT on clone suggests osd bug) has now gone away, and now both osds are reporting the same problem. Will investigate this further.

osd.2

 
  -9> 2019-03-28 17:14:46.482 7f03e9ce7700 20 osd.2 pg_epoch: 66018 pg[2.3a( v 65910'174237 (43944'171225,65910'174237] local-lis/les=66017/66018 n=16265 ec=139/139 lis/c 66015/65471 les/c/f 66016/65473/0 66017/66017/66017) [2,3,4] r=0 lpr=66017 pi=[65471,66017)/3 crt=65910'174237 lcod 0'0 mlcod 0'0 activating mbc={}] rollforward: entry=43947'172084 (1466'48358) modify   2:5f7e2597:::200007988b5.00000000:head by osd.3.0:38024 0.000000 0
  -8> 2019-03-28 17:14:46.482 7f03e94e6700 -1 /src/ceph-13.2.5/src/osd/PGLog.h: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd_pglog::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd_pglog::list<pg_log_entry_t> = std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> >]' thread 7f03e94e6700 time 2019-03-28 17:14:46.484179
/src/ceph-13.2.5/src/osd/PGLog.h: 926: FAILED assert(objiter->second->version > last_divergent_update)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f040cabb5c2]
 2: (()+0x2f9787) [0x7f040cabb787]
 3: (void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x3e7d) [0x87473d]
 4: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xe7c) [0x86dccc]
 5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x7cac74]
 6: (PG::RecoveryState::Stray::react(MLogRec const&)+0xef) [0x811dbf]
 7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1a8) [0x860968]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x8335db]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x819243]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x75a45f]
 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x9c8ad0]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x76a1f0]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7f040cac16f6]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f040cac28b0]
 15: (()+0x76ba) [0x7f040b12f6ba]
 16: (clone()+0x6d) [0x7f040a73e41d]

osd.5

    -1> 2019-03-28 17:03:25.077 7f878c5cd700 20 osd.5 pg_epoch: 66005 pg[1.cas1( v 5281'140312 (2469'137238,5281'140312] local-lis/les=66000/66003 n=67222 ec=71/51 lis/c 66000/5135 les/c/f 66003/5144/0 66000/66000/65994) [0,5,2147483647,3,4,7]p0(0) r=1 lpr=66000 pi=[5135,66000)/18 luod=0'0 crt=5281'140312 lcod 0'0 active mbc={}] handle_activate_map: Not dirtying info: last_persisted is 66003 while current is 66005
     0> 2019-03-28 17:03:25.077 7f878d5cf700 -1 /src/ceph-13.2.5/src/osd/PGLog.h: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, mempool::osd_pglog::list<pg_log_entry_t>&, const pg_info_t&, eversion_t, eversion_t, missing_type&, PGLog::LogEntryHandler*, const DoutPrefixProvider*) [with missing_type = pg_missing_set<true>; mempool::osd_pglog::list<pg_log_entry_t> = std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> >]' thread 7f878d5cf700 time 2019-03-28 17:03:25.077113
/src/ceph-13.2.5/src/osd/PGLog.h: 926: FAILED assert(objiter->second->version > last_divergent_update)

 ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f87aeba05c2]
 2: (()+0x2f9787) [0x7f87aeba0787]
 3: (void PGLog::_merge_object_divergent_entries<pg_missing_set<true> >(PGLog::IndexedLog const&, hobject_t const&, std::__cxx11::list<pg_log_entry_t, mempool::pool_allocator<(mempool::pool_index_t)14, pg_log_entry_t> > const&, pg_info_t const&, eversion_t, eversion_t, pg_missing_set<true>&, PGLog::LogEntryHandler*, DoutPrefixProvider const*)+0x3e7d) [0x87473d]
 4: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xe7c) [0x86dccc]
 5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x7cac74]
 6: (PG::RecoveryState::Stray::react(MLogRec const&)+0xef) [0x811dbf]
 7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x1a8) [0x860968]
 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x8335db]
 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x819243]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x75a45f]
 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x9c8ad0]
 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x76a1f0]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7f87aeba66f6]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f87aeba78b0]
 15: (()+0x76ba) [0x7f87ad2146ba]
 16: (clone()+0x6d) [0x7f87ac82341d]

#18 Updated by Neha Ojha 6 months ago

  • Related to Bug #39023: osd/PGLog: preserve original_crt to check rollbackability added

#19 Updated by Samuel Just 6 months ago

Grant: I notice that the initial event outlined above is from October. Is that the very first anomalous behavior exhibited by this cluster? I notice from the osd.2 logs that 1.cas0 was already missing two osds. Were there failures prior to that point?

#20 Updated by Neha Ojha 6 months ago

Hi Grant, is there a way you could dump the pg log by using a command like this "ceph-objectstore-tool --no-mon-config --data-path dev/osd0 --op log --pgid 1.0" for 1.cas0 to start with?

#21 Updated by Grant Slater 6 months ago

Output from: ceph-objectstore-tool --no-mon-config --data-path /var/lib/ceph/osd/ceph-0 --op log --pgid 1.cas0

1.cas0 ceph-post-file: afac3c51-58c7-497a-9e58-564cb43f4755

#22 Updated by Grant Slater 6 months ago

Per request on irc.

pg log:
1.cas2 on osd.2: ceph-post-file: d74a0006-c0e9-41b1-a904-7bfe41617253
1.96s3 on osd.5: ceph-post-file: 80178cca-1a7b-44b8-9226-df345f53d15c
1.96s0 on osd.4: ceph-post-file: 9ba9ebbf-aca5-4cb4-8727-49a220e39042

#23 Updated by David Zafman 6 months ago

  • Related to Bug #16279: assert(objiter->second->version > last_divergent_update) failed added

#24 Updated by David Zafman 6 months ago

Maybe we could check each log in load_pgs(). If it is corrupt (head != head entry's version), move PG aside and ignore it. If this bug's reporter boots a cluster with this OSD patch maybe the cluster can recover/backfill all the PG shards that were lost.

A less invasive scenario is to report the bad logs and then export-remove all PGs that reported as bad throughout the cluster. This way the bad PGs are saved, but maybe the cluster can recover itself. Send one of the exports using ceph-post-file.

#25 Updated by David Zafman 6 months ago

Here is what the bad log looks like that caused one of the crashes. Clearly head is bad because the log ends with 5338'138875 instead of 5163'138877.
Did a corruption occur during 5163 epoch or when a previous merge attempt was made? Or is this a log update not happening due to unwritten data because of crash or power outage?

{
    "pg_log_t": {
        "head": "5163'138877",
        "tail": "2469'135808",
        "log": [
            {
                "op": "modify",
                "object": "1:6954f269:::100007ff139.00000000:head",
                "version": "2469'135809",
                "prior_version": "0'0",
                "reqid": "client.470522.0:11362371",
                "extra_reqids": [],
                "mtime": "2009-08-05 12:15:19.000000",
                "return_code": 0,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": true,
                        "rollback_info_completed": true,
                        "ops": [
                            {
                                "code": "CREATE" 
                            }
                        ]
                    }
                }
            },
...
            {
                "op": "modify",
                "object": "1:694806f7:::20000a2c83b.00000093:head",
                "version": "5338'138875",
                "prior_version": "3085'138827",
                "reqid": "client.1456106.0:1718",
                "extra_reqids": [],
                "mtime": "2018-06-14 23:23:37.541317",
                "return_code": 0,
                "mod_desc": {
                    "object_mod_desc": {
                        "can_local_rollback": true,
                        "rollback_info_completed": false,
                        "ops": [
                            {
                                "code": "SETATTRS",
                                "attrs": [
                                    "_",
                                    "hinfo_key",
                                    "snapset" 
                                ]
                            },
                            {
                                "code": "ROLLBACK_EXTENTS",
                                "gen": 138875,
                                "snaps": "[0,1048576]" 
                            }
                        ]
                    }
                }
            }
        ],
        "dups": []
    },
    "pg_missing_t": {
        "missing": [
            {
                "object": "1:694806f7:::20000a2c83b.00000093:head",
                "need": "5338'138875",
                "have": "3085'138827",
                "flags": "none" 
            },
            {
                "object": "1:69553ec9:::20000a2c83a.00000018:head",
                "need": "5157'138873",
                "have": "3082'138823",
                "flags": "none" 
            }
        ],
        "may_include_deletes": true
    }
}

#26 Updated by Grant Slater 6 months ago

Yes, most likely the issue was triggered by a power outage, the 2x OSD FAILED assert and the cluster is unable to recover.

#27 Updated by David Zafman 4 months ago

  • Status changed from Verified to Rejected

I'm closing this bug. The hardware configuration must make data safe that has been sync'ed to disk. This requires the proper disk cache settings.

An upgrade to Nautilus would also be recommended.

Also available in: Atom PDF