Project

General

Profile

Bug #8346

OSD crashes on master (FAILED assert(ip_op.waiting_for_commit.count(from)))

Added by John Spray almost 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Revision bfce3d4, home-built packages on Fedora 20: wasn't actually trying to test Ceph, but since I saw some crashes, I've opened this and attached logs.

Things started going wrong after doing a couple of spurious "ceph osd down" and "ceph osd out" operations, although some of the crashes happened a few minutes later, possibly in response to the resulting relocation of PGs.

osd.1:

osd/OSD.cc: 4768: FAILED assert(osd->is_active())

 ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca)
 1: (OSDService::share_map(entity_name_t, Connection*, unsigned int, std::tr1::shared_ptr<OSDMap const>&, unsigned int*)+0x5b3) [0x6335a3] 2: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x187) [0x633827]
 3: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93]
 4: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30]
 6: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20]
 7: (()+0x7f33) [0x7f58ad293f33]
 8: (clone()+0x6d) [0x7f58abd24ded]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

osd.3:

osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from))

 ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca)
 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b]
 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15]
 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20]
 9: (()+0x7f33) [0x7f4a1600ef33]
 10: (clone()+0x6d) [0x7f4a14a9fead]

osd.4

osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from))

 ceph version 0.80-419-gbfce3d4 (bfce3d4facad93ce6528a9595e9b3feb9d0884ca)
 1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x5ab) [0x92242b]
 2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1f6) [0x922736]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x303) [0x7bb403]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x475) [0x633b15]
 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x203) [0x633f93]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0xac) [0x6689bc]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xa75d30]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa76c20]
 9: (()+0x7f33) [0x7f3e79608f33]
 10: (clone()+0x6d) [0x7f3e78099ded]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

osd_logs.tar.gz (2.06 MB) John Spray, 05/13/2014 02:50 PM

ceph-osd.15.log View (16.9 MB) Sahana Lokeshappa, 06/25/2014 02:27 AM

log_osd_678.gz (4.35 MB) Sahana Lokeshappa, 07/18/2014 05:38 AM

log_osd_345.gz (3.39 MB) Sahana Lokeshappa, 07/18/2014 05:38 AM

log_osd_012.gz (4.62 MB) Sahana Lokeshappa, 07/18/2014 05:38 AM

monlogs.gz (3.2 MB) Sahana Lokeshappa, 07/18/2014 05:38 AM

History

#1 Updated by Samuel Just almost 10 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Samuel Just almost 10 years ago

Hmm, ceph-osd.0.log appears to have a bunch of 0s in the middle of the file.

#3 Updated by Sahana Lokeshappa almost 10 years ago

Even in my setup of 8 nodes : with 3 osd in each node, got the same assert:
Steps to reproduce:
While IO was going on , rebooted one of the nodes which had osds: osd.6,osd,7,osd.8

OSd.15 in other node crashed with assert:

2014-06-19 07:53:36.163182 7f6c132d0700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::sub_op_modify_reply(OpRequestRef)' thread 7f6c132d0700 time 2014-06-19 07:53:36.161726
osd/ReplicatedBackend.cc: 641: FAILED assert(ip_op.waiting_for_commit.count(from))
ceph version andisk-sprint-2-drop-3-390-g2dbd85c (2dbd85c94cf27a1ff0419c5ea9359af7fe30e9b6)
1: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x591) [0x903fe1]
2: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1e6) [0x9042c6]
3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7aedcb]
4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x459) [0x635719]
5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x346) [0x635ce6]
6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ce) [0xa4a1ce]
7: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa4c420]
8: (()+0x8182) [0x7f6c2ecff182]
9: (clone()+0x6d) [0x7f6c2d0a030d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#4 Updated by Samuel Just almost 10 years ago

Please enable
debug osd = 20
debug filestore = 20
debug ms = 1

reproduce, and attach the log.

#5 Updated by Samuel Just almost 10 years ago

  • Assignee set to Samuel Just

#6 Updated by Samuel Just over 9 years ago

  • Status changed from New to Can't reproduce

#7 Updated by Sahana Lokeshappa over 9 years ago

It reproduced.

Setup details:

3 osd nodes (3 osds in each node)
3 monitors

rebooted the node with osds:6,7,8)

Please find attached logs of osds and monitors with debug enabled as requested

#8 Updated by Pavan Rallabhandi over 9 years ago

For the asserts seen in OSD.0 and OSD.2 logs, there is a tracker #8887.

For OSD.1, going through the logs, sub_op_modify_reply seemed to be invoked from the same OSD multiple times leading to an assert, pasted below one such instance for tid 29459:

<snip>

-1698> 2014-07-17 17:54:12.879138 7fddb5bf9700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1052 crt=0'0 lcod 64'1051 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 8

-1625> 2014-07-17 17:54:12.879846 7fddb83fe700 7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op ack_type 4 from 5

-1541> 2014-07-17 17:54:12.880847 7fddb5bf9700  7 osd.1 pg_epoch: 64 pg[4.32( v 64'1059 (0'0,64'1059] local-les=62 n=323 ec=61 les/c 62/62 61/61/61) [1,5,8] r=0 lpr=61 luod=64'1053 crt=0'0 lcod 64'1052 mlcod 64'1050 active+clean] sub_op_modify_reply: tid 29459 op  ack_type 4 from 8

<\snip>

Not sure if this is another manifestation of #8504, where in the same op is dequeued twice by the OSD, resulting in this behavior.

#9 Updated by Greg Farnum over 9 years ago

Yeah, looks like a dup of #8887 as well.

Also available in: Atom PDF