Project

General

Profile

Bug #22144

*** Caught signal (Aborted) ** in thread thread_name:tp_peering

Added by Ashley Merrick over 6 years ago. Updated over 5 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Receiving the above error across multiple OSD's running either Bluestore or Filestore.

Causes the OSD to go into a continuous loop and has brought down the cluster, has been experienced by atleast one other user on the [ML]

-9> 2017-11-15 17:37:14.696229 7fa4ec50f700  1 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] state<Start>: transitioning to Stray
-8> 2017-11-15 17:37:14.696239 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] exit Start 0.000019 0 0.000000
-7> 2017-11-15 17:37:14.696250 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f9s1( v 161563'158209 lc 161175'158153 (150659'148187,161563'158209] local-lis/les=161519/161521 n=47572 ec=31534/31534 lis/c 161519/152474 les/c/f 161521/152523/159786 161517/161519/161519) [34,37,13,12,66,69,118,120,28,20,88,0,2]/[34,37,13,12,66,69,118,120,28,20,53,54,2147483647] r=1 lpr=161563 pi=[152474,161519)/1 crt=161562'158208 lcod 0'0 unknown NOTIFY m=21] enter Started/Stray
-6> 2017-11-15 17:37:14.696324 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Reset 3.363755 2 0.000076
-5> 2017-11-15 17:37:14.696337 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started
-4> 2017-11-15 17:37:14.696346 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Start
-3> 2017-11-15 17:37:14.696353 7fa4ec50f700 1 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] state<Start>: transitioning to Stray
-2> 2017-11-15 17:37:14.696364 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] exit Start 0.000018 0 0.000000
-1> 2017-11-15 17:37:14.696372 7fa4ec50f700 5 osd.37 pg_epoch: 161571 pg[6.2f2s10( v 161570'157712 lc 161175'157648 (160455'154564,161570'157712] local-lis/les=161517/161519 n=47328 ec=31534/31534 lis/c 161517/160962 les/c/f 161519/160963/159786 161517/161517/108939) [96,100,79,4,69,65,57,59,135,134,37,35,18] r=10 lpr=161570 pi=[160962,161517)/2 crt=161560'157711 lcod 0'0 unknown NOTIFY m=5] enter Started/Stray
0> 2017-11-15 17:37:14.697245 7fa4ebd0e700 -1 ** Caught signal (Aborted) *
in thread 7fa4ebd0e700 thread_name:tp_peering

ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
1: (()+0xa3acdc) [0x55dfb6ba3cdc]
2: (()+0xf890) [0x7fa510e2c890]
3: (gsignal()+0x37) [0x7fa50fe66067]
4: (abort()+0x148) [0x7fa50fe67448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x55dfb6be6f5f]
6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x14e3) [0x55dfb670f8a3]
7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) [0x55dfb670ff39]
8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x244) [0x55dfb67552a4]
9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x55dfb6732c1b]
10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x3e3) [0x55dfb6702ef3]
11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >)+0x20a) [0x55dfb664db2a]
12: (OSD::process_peering_events(std::list<PG
, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x175) [0x55dfb664e6b5]
13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x55dfb66ae5a7]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55dfb6bedb1f]
15: (ThreadPool::WorkThread::entry()+0x10) [0x55dfb6beea50]
16: (()+0x8064) [0x7fa510e25064]
17: (clone()+0x6d) [0x7fa50ff1962d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Have tried with logging level 10-20 but nothing extra around the error is given, also tried the latest master release via GIT and still experienced the same, all started after needing to replace a OSD, and the same experience by the user on [ML]

History

#1 Updated by Ashley Merrick over 6 years ago

    -2> 2017-11-17 04:20:35.364756 7f2564933700 10 osd.38 pg_epoch: 205167 pg[6.2a5s12( v 161559'158785 lc 161175'158737 (150666'149218,161559'158785] local-lis/les                         =205155/205158 n=47245 ec=31534/31534 lis/c 205155/157392 les/c/f 205158/157393/159786 205149/205155/180185) [125,114,136,2147483647,29,21,7,78,59,2147483647,92,96,                         38]/[125,114,136,2147483647,29,21,7,2147483647,59,2147483647,92,96,38] r=12 lpr=205158 pi=[157392,205155)/6 crt=161559'158784 lcod 0'0 unknown NOTIFY m=2] check_rec                         overy_sources no source osds () went down
    -1> 2017-11-17 04:20:35.364770 7f2564933700 10 osd.38 pg_epoch: 205167 pg[6.2a5s12( v 161559'158785 lc 161175'158737 (150666'149218,161559'158785] local-lis/les                         =205155/205158 n=47245 ec=31534/31534 lis/c 205155/157392 les/c/f 205158/157393/159786 205149/205155/180185) [125,114,136,2147483647,29,21,7,78,59,2147483647,92,96,                         38]/[125,114,136,2147483647,29,21,7,2147483647,59,2147483647,92,96,38] r=12 lpr=205158 pi=[157392,205155)/6 crt=161559'158784 lcod 0'0 unknown NOTIFY m=2] handle_ac                         tivate_map
     0> 2017-11-17 04:20:35.364764 7f2565134700 -1 *** Caught signal (Aborted) **
 in thread 7f2565134700 thread_name:tp_peering

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)
 1: (()+0xa3acdc) [0x5561716c3cdc]
 2: (()+0xf890) [0x7f25890e4890]
 3: (gsignal()+0x37) [0x7f258811e067]
 4: (abort()+0x148) [0x7f258811f448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27f) [0x556171706f5f]
 6: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, in                         t, ObjectStore::Transaction*)+0x14e3) [0x55617122f8a3]
 7: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x539) [0x55617122ff39]
 8: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na                         , mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::sta                         techart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x244) [0x5561712752a4]
 9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_transl                         ator>::send_event(boost::statechart::event_base const&)+0x6b) [0x556171252c1b]
 10: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::alloc                         ator<int> >&, int, PG::RecoveryCtx*)+0x3e3) [0x556171222ef3]
 11: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allo                         cator<boost::intrusive_ptr<PG> > >*)+0x20a) [0x55617116db2a]
 12: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x175) [0x55617116e6b5]
 13: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x5561711ce5a7]
 14: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa8f) [0x55617170db1f]
 15: (ThreadPool::WorkThread::entry()+0x10) [0x55617170ea50]
 16: (()+0x8064) [0x7f25890dd064]
 17: (clone()+0x6d) [0x7f25881d162d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#2 Updated by Ashley Merrick over 6 years ago

Have also tried to start an OSD with noup set as suggested by a user on the ML.

However OSD still fails on the same Assert, cluster does not see the OSD go up however.

#3 Updated by Greg Farnum over 6 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
  • Status changed from New to Can't reproduce
  • Component(RADOS) OSD added

This was discussed on the mailing list thread "[ceph-users] OSD Random Failures - Latest Luminous" and ended without getting the info required to diagnose. :(

#4 Updated by Rams C over 5 years ago

we can confirm we are experiencing the same issue on version 12.2.7 and currently have some random osds that went offline and won’t come up. Several pgs are down and some inactive.

Also available in: Atom PDF