Project

General

Profile

Bug #23892

luminous->mimic: mon segv in ~MonOpRequest from OpHistoryServiceThread

Added by Sage Weil almost 6 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

     0> 2018-04-26 19:35:25.704 7faeeca14700 -1 *** Caught signal (Segmentation fault) **
 in thread 7faeeca14700 thread_name:OpHistorySvc

 ceph version 13.0.2-1854-g97635aa (97635aacc8013122f577e745aec3e30962e19b68) mimic (dev)
 1: (()+0x4b0cc0) [0x55f77ee6bcc0]
 2: (()+0x11390) [0x7faef4035390]
 3: (RefCountedObject::put() const+0xb8) [0x55f77ec65308]
 4: (MonOpRequest::~MonOpRequest()+0x32) [0x55f77ec65602]
 5: (std::_Rb_tree<std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::pair<double, boost::intrusive_ptr<TrackedOp> >, std::_Identity<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::less<std::pair<double, boost::intrusive_ptr<TrackedOp> > >, std::allocator<std::pair<double, boost::intrusive_ptr<TrackedOp> > > >::_M_erase_aux(std::_Rb_tree_const_iterator<std::pair<double, boost::intrusive_ptr<TrackedOp> > >)+0x83) [0x7faef474a083]
 6: (OpHistory::cleanup(utime_t)+0x2b4) [0x7faef4746d04]
 7: (OpHistory::_insert_delayed(utime_t const&, boost::intrusive_ptr<TrackedOp>)+0x20c) [0x7faef474768c]
 8: (OpHistoryServiceThread::entry()+0xe9) [0x7faef4747a69]
 9: (()+0x76ba) [0x7faef402b6ba]
 10: (clone()+0x6d) [0x7faef338941d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---

gdb shows (MonOpRequest)0x55f78083ba40 and request (Message)0x55f780ba1600

which is

 -2359> 2018-04-26 19:35:20.148 7faee920d700  1 -- 172.21.15.110:6789/0 <== mon.2 172.21.15.94:6790/0 1947548406 ==== mon_probe(reply 3d39e9e8-3d17-457e-b675-3c058fc49b2f name c quorum 0,1,2 paxos( fc 1 lc 109 )) v6 ==== 427+0+0 (2026604521 0 0) 0x55f780ba1600 con 0x55f780caa300
 -2358> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 _ms_dispatch new session 0x55f780835180 MonSession(mon.2 172.21.15.94:6790/0 is open , features 0x1ffddff8eea4fffb (luminous)) features 0x1ffddff8eea4fffb
 -2357> 2018-04-26 19:35:20.148 7faee920d700  5 mon.a@1(probing) e1 _ms_dispatch setting monitor caps on this connection
 -2356> 2018-04-26 19:35:20.148 7faee920d700 20 mon.a@1(probing) e1  caps allow *
 -2355> 2018-04-26 19:35:20.148 7faee920d700 20 is_capable service=mon command= read on cap allow *
 -2354> 2018-04-26 19:35:20.148 7faee920d700 20  allow so far , doing grant allow *
 -2353> 2018-04-26 19:35:20.148 7faee920d700 20  allow all
 -2352> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 handle_probe mon_probe(reply 3d39e9e8-3d17-457e-b675-3c058fc49b2f name c quorum 0,1,2 paxos( fc 1 lc 109 )) v6
 -2351> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 handle_probe_reply mon.2 172.21.15.94:6790/0mon_probe(reply 3d39e9e8-3d17-457e-b675-3c058fc49b2f name c quorum 0,1,2 paxos( fc 1 lc 109 )) v6
 -2350> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1  monmap is e1: 3 mons at {a=172.21.15.110:6789/0,b=172.21.15.94:6789/0,c=172.21.15.94:6790/0}
 -2349> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1  peer name is c
 -2348> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1  existing quorum 0,1,2
 -2347> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1  peer paxos version 109 vs my version 109 (ok)
 -2346> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 start_election
 -2345> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 _reset
 -2344> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 cancel_probe_timeout 0x55f7810feb20
 -2343> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 timecheck_finish
 -2342> 2018-04-26 19:35:20.148 7faee920d700 15 mon.a@1(probing) e1 health_tick_stop
 -2341> 2018-04-26 19:35:20.148 7faee920d700 15 mon.a@1(probing) e1 health_interval_stop
 -2340> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 scrub_event_cancel
 -2339> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing) e1 scrub_reset
 -2338> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxos(paxos recovering c 1..109) restart -- canceling timeouts
 -2337> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(mdsmap 1..7) restart
 -2336> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(osdmap 1..17) restart
 -2335> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(logm 1..28) restart
 -2334> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(monmap 1..1) restart
 -2333> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(auth 1..8) restart
 -2332> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(mgr 1..4) restart
 -2331> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(mgrstat 1..52) restart
 -2330> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(health 1..1) restart
 -2329> 2018-04-26 19:35:20.148 7faee920d700 10 mon.a@1(probing).paxosservice(config 0..0) restart
 -2328> 2018-04-26 19:35:20.148 7faee920d700  0 log_channel(cluster) log [INF] : mon.a calling monitor election
...

/a/sage-2018-04-26_19:17:26-upgrade:luminous-x-wip-sage-testing-2018-04-26-1251-distro-basic-smithi/2442438


Related issues

Related to RADOS - Bug #24037: osd: Assertion `!node_algorithms::inited(this->priv_value_traits().to_node_ptr(value))' failed. Resolved 05/07/2018

History

#1 Updated by Radoslaw Zarzynski almost 6 years ago

  • Assignee set to Radoslaw Zarzynski

#2 Updated by Sage Weil almost 6 years ago

  • Priority changed from Urgent to High

#3 Updated by Radoslaw Zarzynski almost 6 years ago

  • Related to Bug #24037: osd: Assertion `!node_algorithms::inited(this->priv_value_traits().to_node_ptr(value))' failed. added

#4 Updated by Greg Farnum over 4 years ago

  • Status changed from 12 to Can't reproduce

Believe we've made some fixes to OpHistory since April last year...

Also available in: Atom PDF