Bug #8346
closedOSD crashes on master (FAILED assert(ip_op.waiting_for_commit.count(from)))
0%
Description
Revision bfce3d4, home-built packages on Fedora 20: wasn't actually trying to test Ceph, but since I saw some crashes, I've opened this and attached logs.
Things started going wrong after doing a couple of spurious "ceph osd down" and "ceph osd out" operations, although some of the crashes happened a few minutes later, possibly in response to the resulting relocation of PGs.
osd.1:
osd/OSD.cc: 4768: FAILED assert(osd->is_active()) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x5b3) [0x6335a3] 2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x187) [0x633827] 3: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 4: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 6: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 7: (()+0x7f33) [0x7f58ad293f33] 8: (clone()+0x6d) [0x7f58abd24ded] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
osd.3:
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from)) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 9: (()+0x7f33) [0x7f4a1600ef33] 10: (clone()+0x6d) [0x7f4a14a9fead]
osd.4
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from)) ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca) 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b] 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736] 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403] 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15] 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93] 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30] 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20] 9: (()+0x7f33) [0x7f3e79608f33] 10: (clone()+0x6d) [0x7f3e78099ded] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Samuel Just almost 10 years ago
- Priority changed from Normal to Urgent
Updated by Samuel Just almost 10 years ago
Hmm, ceph-osd.0.log appears to have a bunch of 0s in the middle of the file.
Updated by Sahana Lokeshappa almost 10 years ago
- File ceph-osd.15.log ceph-osd.15.log added
Even in my setup of 8 nodes : with 3 osd in each node, got the same assert:
Steps to reproduce:
While IO was going on , rebooted one of the nodes which had osds: osd.6,osd,7,osd.8
OSd.15 in other node crashed with assert:
2014-06-19 07:53:36.163182 7f6c132d0700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_reply(OpRequestRef)' thread 7f6c132d0700 time 2014-06-19 07:53:36.161726
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from))
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x591) [0x903fe1]
2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1e6) [0x9042c6]
3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7aedcb]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x459) [0x635719]
5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
8: (()+0x8182) [0x7f6c2ecff182]
9: (clone()+0x6d) [0x7f6c2d0a030d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Samuel Just almost 10 years ago
Please enable
debug osd = 20
debug filestore = 20
debug ms = 1
reproduce, and attach the log.
Updated by Samuel Just almost 10 years ago
- Status changed from New to Can't reproduce
Updated by Sahana Lokeshappa almost 10 years ago
- File log_osd_678.gz log_osd_678.gz added
- File log_osd_345.gz log_osd_345.gz added
- File log_osd_012.gz log_osd_012.gz added
- File monlogs.gz monlogs.gz added
It reproduced.
Setup details:
3 osd nodes (3 osds in each node)
3 monitors
rebooted the node with osds:6,7,8)
Please find attached logs of osds and monitors with debug enabled as requested
Updated by Pavan Rallabhandi almost 10 years ago
For the asserts seen in OSD.0 and OSD.2 logs, there is a tracker #8887.
For OSD.1, going through the logs, sub_op_modify_reply seemed to be invoked from the same OSD multiple times leading to an assert, pasted below one such instance for tid 29459:
<snip>
-1698> 2014-07-17 17:54:12.879138 7fddb5bf9700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1052 crt=0'0 lcod 64'1051 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 8
-1625> 2014-07-17 17:54:12.879846 7fddb83fe700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 5
-1541> 2014-07-17 17:54:12.880847 7fddb5bf9700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 8
<\snip>
Not sure if this is another manifestation of #8504, where in the same op is dequeued twice by the OSD, resulting in this behavior.
Updated by Greg Farnum almost 10 years ago
Yeah, looks like a dup of #8887 as well.