Bug #13499
FAILED assert(repop_queue.front() == repop)
Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi,
we are on Firefly 0.80.10-1~bpo70+1 and one of our OSDs crashed with the following trace:
2015-10-12 10:00:26.023105 7fb359cab700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fb359cab700 time 2015-10-12 10:00:25.988950 osd/ReplicatedPG.cc: 6742: FAILED assert(repop_queue.front() == repop) ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xdd8) [0x8fbb48] 2: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xca) [0x8fbeaa] 3: (Context::complete(int)+0x9) [0x790379] 4: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x1de) [0xa3a8ae] 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2b6) [0xa3af56] 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a5) [0x8e7025] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x740ea6] 8: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1ea) [0x75faaa] 9: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x79c78e] 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8469a] 11: (ThreadPool::WorkThread::entry()+0x10) [0xb858f0] 12: (()+0x6b50) [0x7fb387ad0b50] 13: (clone()+0x6d) [0x7fb3866f495d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
then dumps all the recent events and again:
2015-10-12 10:00:26.149751 7fb359cab700 -1 *** Caught signal (Aborted) ** in thread 7fb359cab700 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) 1: /usr/bin/ceph-osd() [0xab7562] 2: (()+0xf0a0) [0x7fb387ad90a0] 3: (gsignal()+0x35) [0x7fb38664b165] 4: (abort()+0x180) [0x7fb38664e3e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb386ea189d] 6: (()+0x63996) [0x7fb386e9f996] 7: (()+0x639c3) [0x7fb386e9f9c3] 8: (()+0x63bee) [0x7fb386e9fbee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb928ea] 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xdd8) [0x8fbb48] 11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xca) [0x8fbeaa] 12: (Context::complete(int)+0x9) [0x790379] 13: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x1de) [0xa3a8ae] 14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2b6) [0xa3af56] 15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a5) [0x8e7025] 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x740ea6] 17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1ea) [0x75faaa] 18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x79c78e] 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8469a] 20: (ThreadPool::WorkThread::entry()+0x10) [0xb858f0] 21: (()+0x6b50) [0x7fb387ad0b50] 22: (clone()+0x6d) [0x7fb3866f495d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
We started the OSD and everything is stable again. I searched the forums a little bit it seems that these crashes used to happen in very old ceph versions.
History
#1 Updated by Ilya Dryomov about 7 years ago
- Priority changed from Normal to High
- Release deleted (
firefly) - Release set to jewel
Here is one on current jewel:
2017-02-26 22:11:22.246150 7f9c4a228700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7f9c4a228700 time 2017-02-26 22:11:22.243398 osd/ReplicatedPG.cc: 8440: FAILED assert(repop_queue.front() == repop) ceph version 10.2.5-6111-gac3ba2a (ac3ba2adcd21ac011ad556ac4506623e61fbe696) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x563718fff3c5] 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xf54) [0x563718abcf04] 3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xcc) [0x563718abd1ac] 4: (Context::complete(int)+0x9) [0x56371899fd19] 5: (ReplicatedBackend::sub_op_modify_reply(std::shared_ptr<OpRequest>)+0x369) [0x563718b53b09] 6: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x18b) [0x563718b650bb] 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x563718abd970] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x563718971f6d] 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x5637189721bd] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x563718976ce9] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x563718fef367] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563718ff12d0] 13: (()+0x7dc5) [0x7f9c6a5d4dc5] 14: (clone()+0x6d) [0x7f9c68c5f73d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
http://pulpito.ceph.com/teuthology-2017-02-26_10:15:02-krbd-jewel-testing-basic-smithi/860992
#2 Updated by Sage Weil almost 7 years ago
- Status changed from New to Can't reproduce
we don't see this on newer code.