Project

General

Profile

Actions

Bug #22144

closed

*** Caught signal (Aborted) ** in thread thread_name:tp_peering

Added by Ashley Merrick over 6 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Receiving the above error across multiple OSD's running either Bluestore or Filestore.

Causes the OSD to go into a continuous loop and has brought down the cluster, has been experienced by atleast one other user on the [ML]

-9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] state<Start>: transitioning to Stray
-8> 2017-11-15 17:37:14.696239 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] exit Start 0.000019 0 0.000000
-7> 2017-11-15 17:37:14.696250 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] enter Started/Stray
-6> 2017-11-15 17:37:14.696324 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.000076
-5> 2017-11-15 17:37:14.696337 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started
-4> 2017-11-15 17:37:14.696346 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Start
-3> 2017-11-15 17:37:14.696353 7fa4ec50f700 1 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] state<Start>: transitioning to Stray
-2> 2017-11-15 17:37:14.696364 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Start 0.000018 0 0.000000
-1> 2017-11-15 17:37:14.696372 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started/Stray
0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 ** Caught signal (Aborted) *
in thread 7fa4ebd0e700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa3acdc) [0x55dfb6ba3cdc]
2: (()+0xf890) [0x7fa510e2c890]
3: (gsignal()+0x37) [0x7fa50fe66067]
4: (abort()+0x148) [0x7fa50fe67448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x55dfb6be6f5f]
6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x14e3) [0x55dfb670f8a3]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) [0x55dfb670ff39]
8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x244) [0x55dfb67552a4]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x55dfb6732c1b]
10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3) [0x55dfb6702ef3]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >)+0x20a) [0x55dfb664db2a]
12: (OSD::process_peering_events(std::list<PG
, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5]
13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f]
15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50]
16: (()+0x8064) [0x7fa510e25064]
17: (clone()+0x6d) [0x7fa50ff1962d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Have tried with logging level 10-20 but nothing extra around the error is given, also tried the latest master release via GIT and still experienced the same, all started after needing to replace a OSD, and the same experience by the user on [ML]

Actions

Also available in: Atom PDF