Project

General

Profile

Actions

Feature #13837

closed

Make snap_trimming robust to inconsistent snaphots

Added by Paul Emmerich over 8 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Hi,

we upgraded to 9.2 and we still encountering a crash that looks a lot like issue #12665
So it looks like the patch didn't make it to the 9.2 release, okay. So we installed versiob 9.2.0-859-gb2f2e6c (b2f2e6c84e89e3b5f02b5a97e8659d9338f9f772) from http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/master/dists/trusty/main/ which should have the fix mentioned in #12665.

However, one of our OSDs is still crashing constantly:

 ceph version 9.2.0-859-gb2f2e6c (b2f2e6c84e89e3b5f02b5a97e8659d9338f9f772)
 1: (()+0x80beba) [0x7f19284efeba]
 2: (()+0x10340) [0x7f1926bc8340]
 3: (ReplicatedPG::trim_object(hobject_t const&)+0x45d) [0x7f19281676fd]
 4: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x424) [0x7f192819f934]
 5: (boost::statechart::simple_state<ReplicatedPG::TrimmingObjects, ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xb4) [0x7f19281ce894]
 6: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0x12b) [0x7f19281baccb]
 7: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, ReplicatedPG::NotTrimming, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x84) [0x7f19281bae94]
 8: (ReplicatedPG::snap_trimmer(unsigned int)+0x454) [0x7f192813e004]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x893) [0x7f192804b133]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x85f) [0x7f19285d293f]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f19285d4840]
 12: (()+0x8182) [0x7f1926bc0182]
 13: (clone()+0x6d) [0x7f1924f0747d]

A full log is available here: https://dl.dropboxusercontent.com/u/24773939/ceph-master.log.gz

I can't get a crash with debug 20/20 as the OSD doesn't seem to do anything at all with debug 20/20: https://dl.dropboxusercontent.com/u/24773939/ceph-master-debug20.log.gz

There are about 1k to 5k IOPS on the cluster, yet the log doesn't show any activity and the bug doesn't trigger. netstat also shows no client connections. This is reproducible, I tried to restart it multiple times with debug 20/20 or 15/15 and it just doesn't do anything.

(Logs as links because redmine gives a "Request Entity too large" error)

Infernalis also breaks the work-around with disk threads = 0 mentioned in the other bug report, so one of our OSDs is now essentially dead. Taking it out also doesn't help as this bug then migrates to other OSDs in our experience...

Any ideas?

Paul

Actions

Also available in: Atom PDF