Project

General

Profile

Bug #14428

Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0))

Added by Michael Hackett about 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
hammer
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Cluster is running 94.5 Hammer. Cache tiering backed with replicated pool. This cluster is comprised of six nodes. Three nodes have 12 spinning disk each, and three nodes have 9 SSDs each. This issue began when it was discovered that the cache tier wasn't flushing or evicting objects to the cold tier as a result of a misconfiguration (cluster had full OSD due to improper config).

ceph pg set_full_ratio was set to 0.98 to allow the cluster to continue to service I/O and attempt to flush the cache tier into free space in the cold tier but it was noticed that OSD's were dropping and back_fill_toofull was seen on several PG's due to OSD's in the cache tier being over 85% full.
Options were set to increase backfill_toofull ratios.

OSD's all appeared to be dropping due to this assert (logs from osd.44 and osd.62 are uploaded):

0> 2016-01-19 19:28:36.989371 7f0092cd5700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f0092cd5700 time 2016-01-19 19:28:36.979961
./include/interval_set.h: 386: FAILED assert(_size >= 0)
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9d85]
2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81e120]
3: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >, PG::RecoveryCtx)+0x703) [0x7f2f13]
4: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x3ff) [0x7f599f]
5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x828e88]
6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x15a) [0x82933a]
7: (boost::statechart::simple_state<PG::RecoveryState::WaitUpThru, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd0) [0x826740]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x81156b]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x811704]
10: (PG::handle_activate_map(PG::RecoveryCtx*)+0x134) [0x7bdd14]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >)+0x735) [0x6a5a05]
12: (OSD::process_peering_events(std::list<PG
, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a60cc]
13: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x7015d8]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbba366]
15: (ThreadPool::WorkThread::entry()+0x10) [0xbbb3f0]
16: (()+0x7df5) [0x7f00b04addf5]
17: (clone()+0x6d) [0x7f00aef901ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Currently only 3 out of 27 OSD's in the cache tier are up...

I believe one of these may have been created with --force-nonempty

Possibly hitting: http://tracker.ceph.com/issues/11493

output.tar.gz - ceph -s' 'ceph health detail' 'ceph osd tree' 'ceph osd dump |grep pool' 'ceph pg dump |grep incomplete' all in txt files. (99.7 KB) Michael Hackett, 01/19/2016 08:15 PM


Related issues

Copied to Ceph - Backport #14554: hammer: Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0)) Resolved

Associated revisions

Revision aba6746b (diff)
Added by Alexey Sheplyakov about 8 years ago

PG::activate(): handle unexpected cached_removed_snaps more gracefully

PGPool::update(): ditto

Fixes: #14428
Backport: infernalis, hammer, firefly

Signed-off-by: Alexey Sheplyakov <>

Revision 3d844208 (diff)
Added by Alexey Sheplyakov about 8 years ago

PG::activate(): handle unexpected cached_removed_snaps more gracefully

PGPool::update(): ditto

Fixes: #14428
Backport: infernalis, hammer, firefly

Signed-off-by: Alexey Sheplyakov <>
(cherry picked from commit aba6746b850e9397ff25570f08d0af8847a7162c)

History

#2 Updated by Nathan Cutler about 8 years ago

  • Tracker changed from Tasks to Bug
  • Project changed from Stable releases to Ceph

#4 Updated by Sage Weil about 8 years ago

  • Status changed from New to Pending Backport
  • Backport set to hammer

#5 Updated by Loïc Dachary about 8 years ago

  • Copied to Backport #14554: hammer: Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0)) added

#6 Updated by Loïc Dachary about 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF