Project

General

Profile

Actions

Bug #16279

closed

assert(objiter->second->version > last_divergent_update) failed

Added by huang jun almost 8 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
FileStore, OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We build a cluster with 2 osds,
write 100G files, and shutdown the power during deleting the files.
After the machine reboot, we got one osd crashed,
the osd log shows:

2717> 2016-06-14 11:13:17.222426 7f39273e7700 10 merge_log divergent 54'296 (54'33) delete 67bbeddb/10000000004.0000002c/head//1 by mds.0.1:12518 2016-06-13 17:08:42.484963
-2716> 2016-06-14 11:13:17.222627 7f39273e7700 10 _merge_object_divergent_entries: merging hoid 67bbeddb/10000000004.0000002c/head//1 entries: 54'296 (54'33) delete 67bbeddb/10000000004.0000002c/head//1 by mds.0.1:12518 2016-06-13 17:08:42.484963
-2715> 2016-06-14 11:13:17.222641 7f39273e7700 10 _merge_object_divergent_entries: hoid 67bbeddb/10000000004.0000002c/head//1 prior_version: 54'33 first_divergent_update: 54'296 last_divergent_update: 54'296
-2714> 2016-06-14 11:13:17.225928 7f39273e7700 -1 osd/PGLog.cc: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, const std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, const pg_info_t&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >, PGLog::LogEntryHandler)' thread 7f39273e7700 time 2016-06-14 11:13:17.222650
osd/PGLog.cc: 366: FAILED assert(objiter
>second->version > last_divergent_update)
2016-06-14 11:13:17.358733 7f39273e7700 -1 *** Caught signal (Aborted) *

Relate code:
@
ceph::unordered_map<hobject_t, pg_log_entry_t*>::const_iterator objiter =
log.objects.find(hoid);
if (objiter != log.objects.end() &&
objiter->second->version >= first_divergent_update) {
/// Case 1)
assert(objiter->second->version > last_divergent_update);

ldpp_dout(dpp, 10) << func << ": more recent entry found: " 
<< *objiter->second << ", already merged" << dendl;
// ensure missing has been updated appropriately
if (objiter->second->is_update()) {
assert(missing.is_missing(hoid) &&
missing.missing[hoid].need objiter->second->version);
} else {
assert(!missing.is_missing(hoid));
}
missing.revise_have(hoid, eversion_t());
if (rollbacker && !object_not_in_store)
rollbacker->remove(hoid);
return;
}
@

From the log, there is only one divergent item,
which the "first_divergent_update" equals to "last_divergent_update",
if "objiter->second->version" "first_divergent_update", the assert will fail.


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #21174: OSD crash: 903: FAILED assert(objiter->second->version > last_divergent_update)Rejected08/30/2017

Actions
Actions #1

Updated by Josh Durgin over 7 years ago

  • Assignee set to Josh Durgin
Actions #2

Updated by Yao Ning over 7 years ago

Hi, we got the crash because of the same reason in Ceph 0.94.5

I think it is possible that objiter->second->version == last_divergent_update?

Actions #3

Updated by Honggang Yang over 7 years ago

Yao Ning wrote:

Hi, we got the crash because of the same reason in Ceph 0.94.5

I think it is possible that objiter->second->version == last_divergent_update?

Is there a way to stably reproduce this problem?

Detailed osd log about this pg and crash stack info can help us to find the reason.

Actions #4

Updated by huang jun over 7 years ago

hi, we got this error again,
we plug out the power cable when doing heavy writes,
after machine boot, one osd crashed.

here is the log filtered by thread address:

-785> 2016-12-21 22:16:14.382242 7fad437e8700 10 osd.11 pg_epoch: 255 pg[1.83( v 234'1033 (0'0,234'1033] local-les=226 n=822 ec=50 les/c/f 252/252/0 255/255/255) [11,6] r=0 lpr=255 pi=225-2
54/5 crt=234'1023 lcod 0'0 mlcod 0'0 peering] state&lt;Started/Primary/Peering/GetLog&gt;: GetLog: received master log from osd3
-781> 2016-12-21 22:16:14.382261 7fad437e8700 10 osd.11 pg_epoch: 255 pg[1.83( v 234'1033 (0'0,234'1033] local-les=226 n=822 ec=50 les/c/f 252/252/0 255/255/255) [11,6] r=0 lpr=255 pi=225-2
54/5 crt=234'1023 lcod 0'0 mlcod 0'0 peering] state&lt;Started/Primary/Peering/GetLog&gt;: leaving GetLog
-775> 2016-12-21 22:16:14.382275 7fad437e8700 10 osd.11 pg_epoch: 255 pg[1.83( v 234'1033 (0'0,234'1033] local-les=226 n=822 ec=50 les/c/f 252/252/0 255/255/255) [11,6] r=0 lpr=255 pi=225-2
54/5 crt=234'1023 lcod 0'0 mlcod 0'0 peering] state&lt;Started/Primary/Peering/GetLog&gt;: processing master log
-771> 2016-12-21 22:16:14.382295 7fad437e8700 10 osd.11 pg_epoch: 255 pg[1.83( v 234'1033 (0'0,234'1033] local-les=226 n=822 ec=50 les/c/f 252/252/0 255/255/255) [11,6] r=0 lpr=255 pi=225-2
54/5 crt=234'1023 lcod 0'0 mlcod 0'0 peering] proc_master_log for osd.3: log((0'0,234'1034], crt=234'1034) missing(0)
-768> 2016-12-21 22:16:14.382310 7fad437e8700 10 merge_log log((0'0,234'1034], crt=234'1034) from osd.3 into log((0'0,234'1033], crt=234'1023)
-766> 2016-12-21 22:16:14.382317 7fad437e8700 10 merge_log extending head to 234'1034
-758> 2016-12-21 22:16:14.382324 7fad437e8700 10 merge_log divergent 234'1034 (0'0) modify 7e99ba83/10000000c01.00000f67/head//1 by client.114220.1:12420 2016-12-21 18:54:28.173860
-492> 2016-12-21 22:16:14.383148 7fad437e8700 10 _merge_object_divergent_entries: merging hoid 7e99ba83/10000000c01.00000f67/head//1 entries: 234'1034 (0'0) modify 7e99ba83/10000000c01.00
000f67/head//1 by client.114220.1:12420 2016-12-21 18:54:28.173860
-488> 2016-12-21 22:16:14.383173 7fad437e8700 10 _merge_object_divergent_entries: hoid 7e99ba83/10000000c01.00000f67/head//1 prior_version: 0'0 first_divergent_update: 234'1034 last_diverge
nt_update: 234'1034
-168> 2016-12-21 22:16:14.386491 7fad437e8700 -1 osd/PGLog.cc: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, const std::list&lt;pg
_log_entry_t, std::allocator&lt;pg_log_entry_t&gt; >&, const pg_info_t&, eversion_t, pg_missing_t&, boost::optional&lt;std::pair&lt;eversion_t, hobject_t&gt; >, PGLog::LogEntryHandler)' thread 7fad437e870
0 time 2016-12-21 22:16:14.383189

what i thought:
olog is (0'0,234'1034]
log is (0'0,234'1033]
divergent entry is 234'1034
so the lower_bound is 234'1033

In function _merge_object_divergent_entries()
there is only 1 entry to merge,
so the first_divergent_update equals to last_divergent_update,
in this case, entry "234'1034 (0'0) modify 7e99ba83/10000000c01.00000f67/head//1 by client.114220.1:12420 2016-12-21 18:54:28.173860"
has already merged, it Meet case 1) of "5 distinct cases".

so i think we can let is through by
/// Case 1)
- assert(objiter->second->version > last_divergent_update);
+ assert(objiter->second->version >= last_divergent_update);

Actions #5

Updated by huang jun over 7 years ago

it is a bit weird that the divergent entry is 234'1034
olog is (0'0,234'1034]
log is (0'0,234'1033]

so the lower_bound <= 234'1033

the divergent entry came from log, it should also <= 234'1033,
should never be 234'1034

am i right?

Actions #6

Updated by Samuel Just over 7 years ago

I don't think so, (0'0,234'1034] is the master log, 234'1034 isn't divergent. You should trace back in the log to try to figure out how it ended up in the divergent entry list.

Actions #7

Updated by davinci yin about 7 years ago

1. We got the same problem when the power of data center is shutdown(the electricity was cut off). There are two osds's log:

Apr 9 15:01:43 node10 ceph-osd: osd/PGLog.cc: In function 'static void PGLog::_merge_object_divergent_entries(const PGLog::IndexedLog&, const hobject_t&, const std::list<pg_log_entry_t>&,
const pg_info_t&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >, PGLog::LogEntryHandler, const DoutPrefixProvider*)' thread 7fdeb91bf700 time 2017-04-09 15
:01:43.360441
Apr 9 15:01:43 node10 ceph-osd: osd/PGLog.cc: 391: FAILED assert(objiter->second->version > last_divergent_update)
Apr 9 15:01:43 node10 ceph-osd: ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
Apr 9 15:01:43 node10 ceph-osd: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fded958fb95]
Apr 9 15:01:43 node10 ceph-osd: 2: (PGLog::_merge_object_divergent_entries(PGLog::IndexedLog const&, hobject_t const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, pg
info_t const&, eversion_t, pg_missing_t&, boost::optional<std::pair<eversion_t, hobject_t> >, PGLog::LogEntryHandler, DoutPrefixProvider const*)+0x20d4) [0x7fded9186ea4]
Apr 9 15:01:43 node10 ceph-osd: 3: (PGLog::_merge_divergent_entries(PGLog::IndexedLog const&, std::list<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, pg_info_t const&, eversion_t, pg

missing_t&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >, PGLog::LogEntryHandler, DoutPrefixProvider const*)+0x20b) [0x
7fded918da6b]
Apr 9 15:01:43 node10 ceph-osd: 4: (PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0xd8c) [0x7fded918b14c
]
Apr 9 15:01:43 node10 ceph-osd: 5: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0xbf) [0x7fded8fbd99f]
Apr 9 15:01:43 node10 ceph-osd: 6: (PG::RecoveryState::Stray::react(PG::MLogRec const&)+0x3d1) [0x7fded8fe87a1]
Apr 9 15:01:43 node10 ceph-osd: 7: (boost::statechart::simple_state<PG::RecoveryState::Stray, PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::
react_impl(boost::statechart::event_base const&, void const*)+0x214) [0x7fded9025204]
Apr 9 15:01:43 node10 ceph-osd: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception
_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x7fded900e61b]
Apr 9 15:01:43 node10 ceph-osd: 9: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1f4) [0x7fded8fd58c4]
Apr 9 15:01:43 node10 ceph-osd: 10: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x259) [0x7fded8f2ec19]
Apr 9 15:01:43 node10 ceph-osd: 11: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7fded8f79372]
Apr 9 15:01:43 node10 ceph-osd: 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7fded9580a4e]
Apr 9 15:01:43 node10 ceph-osd: 13: (ThreadPool::WorkThread::entry()+0x10) [0x7fded9581930]
Apr 9 15:01:43 node10 ceph-osd: 14: (()+0x7df3) [0x7fded74a0df3]
Apr 9 15:01:43 node10 ceph-osd: 15: (clone()+0x6d) [0x7fded5b2d3dd]

2. And we also got a problem of XFS file system corruption,the log is:

Apr 10 10:05:17 node1 ceph-osd-prestart.sh: create-or-move updated item name 'osd.19' weight 0.0187 at location {host=node1,root=default} to crush map
Apr 10 10:05:17 node1 kernel: XFS (sdc1): xfs_log_force: error 5 returned.
Apr 10 10:05:17 node1 systemd: Started Ceph object storage daemon.
Apr 10 10:05:17 node1 ceph-osd: 2017-04-10 10:05:17.784412 7fc519049800 -1 [[0;31m ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-19: (2) No such file or directory[[0m
Apr 10 10:05:17 node1 systemd: : main process exited, code=exited, status=1/FAILURE
Apr 10 10:05:17 node1 systemd: Unit entered failed state.
Apr 10 10:05:17 node1 systemd: failed.
Apr 10 10:05:17 node1 kernel: XFS (sdc1): xfs_log_force: error 5 returned.
Apr 10 10:05:17 node1 kernel: XFS (sdc1): xfs_log_force: error 5 returned.

Is there someone can interpret this, or privide us some idea to resolve this problem ?

Actions #8

Updated by Josh Durgin almost 7 years ago

  • Assignee deleted (Josh Durgin)

xfs corruption means your setup was not safe for power failure, or your disk is dying. Neither is something that ceph can do anything about - a corrupt on disk state can lead to a crash like this.

Actions #9

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) FileStore, OSD added
Actions #10

Updated by mingyue zhao over 6 years ago

i have also met this problem when testing pull out disk and insert; ceph version 0.94.5,according @huang jun's osd log,I think the divergent log must not be 234'1034;

local log is (0'0, 234'1033], auth log is (0'0, 234'1034);
if the divergent is 234'1034,it means that log.log.rbegin and log.head are not pointing to the same address;

another doubt is that if divergent only has one node(first_divergent_update == last_divergent_update),and (objiter->second->version > last_divergent_update >= first_divergent_update), whether it means that objiter->second->version must not be equal to first_divergent_update or not?

Actions #12

Updated by Greg Farnum over 5 years ago

  • Status changed from New to Closed

Closing this ticket since the linked PR was also closed.

Actions #13

Updated by David Zafman about 5 years ago

  • Related to Bug #21174: OSD crash: 903: FAILED assert(objiter->second->version > last_divergent_update) added
Actions

Also available in: Atom PDF