Bug #8346
OSD crashes on master (FAILED assert(ip_op.waiting_for_commit.count(from)))
0%
Description
Revision bfce3d4, home-built packages on Fedora 20: wasn't actually trying to test Ceph, but since I saw some crashes, I've opened this and attached logs.
Things started going wrong after doing a couple of spurious "ceph osd down" and "ceph osd out" operations, although some of the crashes happened a few minutes later, possibly in response to the resulting relocation of PGs.
osd.1:
osd/OSD.cc: 4768: FAILED assert(osd->is_active()) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x5b3) [0x6335a3] 2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x187) [0x633827] 3: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 4: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 6: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 7: (()+0x7f33) [0x7f58ad293f33] 8: (clone()+0x6d) [0x7f58abd24ded] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
osd.3:
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from)) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 9: (()+0x7f33) [0x7f4a1600ef33] 10: (clone()+0x6d) [0x7f4a14a9fead]
osd.4
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from)) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 9: (()+0x7f33) [0x7f3e79608f33] 10: (clone()+0x6d) [0x7f3e78099ded] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
History
#1 Updated by Samuel Just almost 10 years ago
- Priority changed from Normal to Urgent
#2 Updated by Samuel Just almost 10 years ago
Hmm, ceph-osd.0.log appears to have a bunch of 0s in the middle of the file.
#3 Updated by Sahana Lokeshappa almost 10 years ago
- File ceph-osd.15.log View added
Even in my setup of 8 nodes : with 3 osd in each node, got the same assert:
Steps to reproduce:
While IO was going on , rebooted one of the nodes which had osds: osd.6,osd,7,osd.8
OSd.15 in other node crashed with assert:
2014-06-19 07:53:36.163182 7f6c132d0700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_reply(OpRequestRef)' thread 7f6c132d0700 time 2014-06-19 07:53:36.161726
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from))
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x591) [0x903fe1]
2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1e6) [0x9042c6]
3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7aedcb]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x459) [0x635719]
5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
8: (()+0x8182) [0x7f6c2ecff182]
9: (clone()+0x6d) [0x7f6c2d0a030d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#4 Updated by Samuel Just almost 10 years ago
Please enable
debug osd = 20
debug filestore = 20
debug ms = 1
reproduce, and attach the log.
#5 Updated by Samuel Just almost 10 years ago
- Assignee set to Samuel Just
#6 Updated by Samuel Just over 9 years ago
- Status changed from New to Can't reproduce
#7 Updated by Sahana Lokeshappa over 9 years ago
- File log_osd_678.gz added
- File log_osd_345.gz added
- File log_osd_012.gz added
- File monlogs.gz added
It reproduced.
Setup details:
3 osd nodes (3 osds in each node)
3 monitors
rebooted the node with osds:6,7,8)
Please find attached logs of osds and monitors with debug enabled as requested
#8 Updated by Pavan Rallabhandi over 9 years ago
For the asserts seen in OSD.0 and OSD.2 logs, there is a tracker #8887.
For OSD.1, going through the logs, sub_op_modify_reply seemed to be invoked from the same OSD multiple times leading to an assert, pasted below one such instance for tid 29459:
<snip>
-1698> 2014-07-17 17:54:12.879138 7fddb5bf9700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1052 crt=0'0 lcod 64'1051 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 8
-1625> 2014-07-17 17:54:12.879846 7fddb83fe700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 5
-1541> 2014-07-17 17:54:12.880847 7fddb5bf9700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 8
<\snip>
Not sure if this is another manifestation of #8504, where in the same op is dequeued twice by the OSD, resulting in this behavior.
#9 Updated by Greg Farnum over 9 years ago
Yeah, looks like a dup of #8887 as well.