Project

General

Profile

Bug #16503

OSD's assert during snap trim osd/ReplicatedPG.cc: 2655: FAILED assert(0)

Added by Michael Hackett over 7 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
High
Assignee:
David Zafman
Category:
OSD
Target version:
% Done:

0%

Source:
Support
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Was running Ceph 94.3 previously and was encountering issue with OSDs asserting due to snapset corruption during scrubbing (https://bugzilla.redhat.com/show_bug.cgi?id=1273127). Updated to Ceph 94.7 as belief was snapset corruption was caused by creating and/or deleting rbd snapshots during pg splitting. This use model creates and deletes thousands of rbd snapshots per day and they had very recently split pgs when this snapset corruption originally started happening.

The 0.94.7 upgrade allowed scrubbing to happen and marked the pgs inconsistent instead. (https://github.com/ceph/ceph/pull/7702) was then able to track down the inconsistencies and resolve them, so all of the pgs are now consistent and scrubbable. The issue is now seeing OSD's segfault during snap trimming.

OSD Assert for OSD.234:

2016-06-27 08:08:16.909337 7f19777c0700 -1 osd/ReplicatedPG.cc: In function 'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 7f19777c0700 time 2016-06-27 08:08:16.903355
osd/ReplicatedPG.cc: 2655: FAILED assert(0)

ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbb1fab]
2: (ReplicatedPG::trim_object(hobject_t const&)+0x1e4) [0x85bb64]
3: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x427) [0x85e287]
4: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x8bf1f4]
5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x5f) [0x8ab92f]
6: (ReplicatedPG::snap_trimmer()+0x52c) [0x82f7fc]
7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x6c43aa]
8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xba2a0e]
9: (ThreadPool::WorkThread::entry()+0x10) [0xba3ab0]
10: (()+0x8182) [0x7f199ee58182]
11: (clone()+0x6d) [0x7f199d3c347d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

Logs are located: https://api.access.redhat.com/rs/cases/01658829/attachments/aa33247b-c123-4085-a276-f9b81c3e83a7

Version-Release number of selected component (if applicable):
Ceph 94.7


Related issues

Duplicated by RADOS - Bug #19320: Pg inconsistent make ceph osd down New 03/21/2017

History

#2 Updated by Ian Colle over 7 years ago

  • Assignee set to David Zafman

#3 Updated by David Zafman over 7 years ago

There is an object rbd_data.b77eb164a531e5.0000000000004fdf in pg 0.1ef1 which has a large snaptrimq, and the object info attr is missing.

#4 Updated by Vikhyat Umrao over 7 years ago

  • Source changed from other to Support

#5 Updated by Vikhyat Umrao over 7 years ago

  • Status changed from New to Rejected

- not a bug.

#6 Updated by Kefu Chai about 7 years ago

  • Duplicated by Bug #19320: Pg inconsistent make ceph osd down added

#7 Updated by Christian Theune over 6 years ago

I'd like to revisit this. Why is this not a bug? (I'm on Hammer 0.94.7.)

We experienced this previously and just have this again on a customer system, where a filesystem inconsistency leads us to crashing OSDs and this is marked as a non bug. I checked the current code on master and there the behaviour has changed (also it indicates that a repair would be needed, which Hammer likely wouldn't support anyway.)

#8 Updated by Nathan Cutler over 6 years ago

Hammer is EOL (End Of Life). Almost certainly, that means there will be no more hammer point releases.

Please consider upgrading to Jewel. Once you are on Jewel, you have the option of upgrading further to Luminous.

Also available in: Atom PDF