Project

General

Profile

Bug #13499

FAILED assert(repop_queue.front() == repop)

Added by Kostis Fardelas over 8 years ago. Updated almost 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,
we are on Firefly 0.80.10-1~bpo70+1 and one of our OSDs crashed with the following trace:

2015-10-12 10:00:26.023105 7fb359cab700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fb359cab700 time 2015-10-12 10:00:25.988950
osd/ReplicatedPG.cc: 6742: FAILED assert(repop_queue.front() == repop)

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xdd8) [0x8fbb48]
 2: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xca) [0x8fbeaa]
 3: (Context::complete(int)+0x9) [0x790379]
 4: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x1de) [0xa3a8ae]
 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2b6) [0xa3af56]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a5) [0x8e7025]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x740ea6]
 8: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1ea) [0x75faaa]
 9: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x79c78e]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8469a]
 11: (ThreadPool::WorkThread::entry()+0x10) [0xb858f0]
 12: (()+0x6b50) [0x7fb387ad0b50]
 13: (clone()+0x6d) [0x7fb3866f495d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

then dumps all the recent events and again:

2015-10-12 10:00:26.149751 7fb359cab700 -1 *** Caught signal (Aborted) **
 in thread 7fb359cab700

 ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70)
 1: /usr/bin/ceph-osd() [0xab7562]
 2: (()+0xf0a0) [0x7fb387ad90a0]
 3: (gsignal()+0x35) [0x7fb38664b165]
 4: (abort()+0x180) [0x7fb38664e3e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fb386ea189d]
 6: (()+0x63996) [0x7fb386e9f996]
 7: (()+0x639c3) [0x7fb386e9f9c3]
 8: (()+0x63bee) [0x7fb386e9fbee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb928ea]
 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xdd8) [0x8fbb48]
 11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xca) [0x8fbeaa]
 12: (Context::complete(int)+0x9) [0x790379]
 13: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x1de) [0xa3a8ae]
 14: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x2b6) [0xa3af56]
 15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1a5) [0x8e7025]
 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x336) [0x740ea6]
 17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1ea) [0x75faaa]
 18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x79c78e]
 19: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb8469a]
 20: (ThreadPool::WorkThread::entry()+0x10) [0xb858f0]
 21: (()+0x6b50) [0x7fb387ad0b50]
 22: (clone()+0x6d) [0x7fb3866f495d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

We started the OSD and everything is stable again. I searched the forums a little bit it seems that these crashes used to happen in very old ceph versions.

History

#1 Updated by Ilya Dryomov about 7 years ago

  • Priority changed from Normal to High
  • Release deleted (firefly)
  • Release set to jewel

Here is one on current jewel:

2017-02-26 22:11:22.246150 7f9c4a228700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7f9c4a228700 time 2017-02-26 22:11:22.243398
osd/ReplicatedPG.cc: 8440: FAILED assert(repop_queue.front() == repop)

 ceph version 10.2.5-6111-gac3ba2a (ac3ba2adcd21ac011ad556ac4506623e61fbe696)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x563718fff3c5]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xf54) [0x563718abcf04]
 3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xcc) [0x563718abd1ac]
 4: (Context::complete(int)+0x9) [0x56371899fd19]
 5: (ReplicatedBackend::sub_op_modify_reply(std::shared_ptr<OpRequest>)+0x369) [0x563718b53b09]
 6: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x18b) [0x563718b650bb]
 7: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x563718abd970]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x563718971f6d]
 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x5637189721bd]
 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x563718976ce9]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x563718fef367]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x563718ff12d0]
 13: (()+0x7dc5) [0x7f9c6a5d4dc5]
 14: (clone()+0x6d) [0x7f9c68c5f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://pulpito.ceph.com/teuthology-2017-02-26_10:15:02-krbd-jewel-testing-basic-smithi/860992

#2 Updated by Sage Weil almost 7 years ago

  • Status changed from New to Can't reproduce

we don't see this on newer code.

Also available in: Atom PDF