Bug #14428
Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0))
0%
Description
Cluster is running 94.5 Hammer. Cache tiering backed with replicated pool. This cluster is comprised of six nodes. Three nodes have 12 spinning disk each, and three nodes have 9 SSDs each. This issue began when it was discovered that the cache tier wasn't flushing or evicting objects to the cold tier as a result of a misconfiguration (cluster had full OSD due to improper config).
ceph pg set_full_ratio was set to 0.98 to allow the cluster to continue to service I/O and attempt to flush the cache tier into free space in the cold tier but it was noticed that OSD's were dropping and back_fill_toofull was seen on several PG's due to OSD's in the cache tier being over 85% full.
Options were set to increase backfill_toofull ratios.
OSD's all appeared to be dropping due to this assert (logs from osd.44 and osd.62 are uploaded):
0> 2016-01-19 19:28:36.989371 7f0092cd5700 -1 ./include/interval_set.h: In function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' thread 7f0092cd5700 time 2016-01-19 19:28:36.979961
./include/interval_set.h: 386: FAILED assert(_size >= 0)
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc9d85]
2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> const&)+0xc0) [0x81e120]
3: (PG::activate(ObjectStore::Transaction&, unsigned int, std::list<Context*, std::allocator<Context*> >&, std::map<int, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > >, std::less<int>, std::allocator<std::pair<int const, std::map<spg_t, pg_query_t, std::less<spg_t>, std::allocator<std::pair<spg_t const, pg_query_t> > > > > >&, std::map<int, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > >, std::less<int>, std::allocator<std::pair<int const, std::vector<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > >, std::allocator<std::pair<pg_notify_t, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > > > > > > > >, PG::RecoveryCtx)+0x703) [0x7f2f13]
4: (PG::RecoveryState::Active::Active(boost::statechart::state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::my_context)+0x3ff) [0x7f599f]
5: (boost::statechart::detail::safe_reaction_result boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::transit_impl<PG::RecoveryState::Active, PG::RecoveryState::RecoveryMachine, boost::statechart::detail::no_transition_function>(boost::statechart::detail::no_transition_function const&)+0xb8) [0x828e88]
6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x15a) [0x82933a]
7: (boost::statechart::simple_state<PG::RecoveryState::WaitUpThru, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xd0) [0x826740]
8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x81156b]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xd4) [0x811704]
10: (PG::handle_activate_map(PG::RecoveryCtx*)+0x134) [0x7bdd14]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >)+0x735) [0x6a5a05]
12: (OSD::process_peering_events(std::list<PG, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x22c) [0x6a60cc]
13: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x28) [0x7015d8]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbba366]
15: (ThreadPool::WorkThread::entry()+0x10) [0xbbb3f0]
16: (()+0x7df5) [0x7f00b04addf5]
17: (clone()+0x6d) [0x7f00aef901ad]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Currently only 3 out of 27 OSD's in the cache tier are up...
I believe one of these may have been created with --force-nonempty
Possibly hitting: http://tracker.ceph.com/issues/11493
Related issues
Associated revisions
PG::activate(): handle unexpected cached_removed_snaps more gracefully
PGPool::update(): ditto
Fixes: #14428
Backport: infernalis, hammer, firefly
Signed-off-by: Alexey Sheplyakov <asheplyakov@mirantis.com>
PG::activate(): handle unexpected cached_removed_snaps more gracefully
PGPool::update(): ditto
Fixes: #14428
Backport: infernalis, hammer, firefly
Signed-off-by: Alexey Sheplyakov <asheplyakov@mirantis.com>
(cherry picked from commit aba6746b850e9397ff25570f08d0af8847a7162c)
History
#2 Updated by Nathan Cutler about 8 years ago
- Tracker changed from Tasks to Bug
- Project changed from Stable releases to Ceph
#3 Updated by Alexey Sheplyakov about 8 years ago
#4 Updated by Sage Weil about 8 years ago
- Status changed from New to Pending Backport
- Backport set to hammer
#5 Updated by Loïc Dachary about 8 years ago
- Copied to Backport #14554: hammer: Unable to bring up OSD's after dealing with FULL cluster (OSD assert with /include/interval_set.h: 386: FAILED assert(_size >= 0)) added
#6 Updated by Loïc Dachary about 8 years ago
- Status changed from Pending Backport to Resolved