Bug #1631
closedosd: failed assert(repop_queue.front() == repop)
0%
Description
This happened on two osds during a multiple_rsync workunit (teuthology:~teuthworker/archive/nightly_coverage_2011-10-19/656/). On osd.0:
2011-10-19 00:12:28.417577 7f409a9a9700 osd.0 4 pg[0.5( v 4'353 (4'183,4'353] n=353 ec=1 les/c 0/4 3/3/2) [0,1] r=0 luod=4'331 lcod 4'331 mlcod 4'260 !hml active+clean] removing repgather(0x520f6c0 applied 4'259 rep_tid=1252 wfack= wfdisk= op=osd_op(client.4105.1:2055 100000009f8.00000000 [write 0~1925 [1@-1]] 0.ecaaaa2d snapc 1=[])) 2011-10-19 00:12:28.417615 7f409a9a9700 osd.0 4 pg[0.5( v 4'353 (4'183,4'353] n=353 ec=1 les/c 0/4 3/3/2) [0,1] r=0 luod=4'331 lcod 4'331 mlcod 4'260 !hml active+clean] q front is repgather(0x26ee900 applied 4'185 rep_tid=940 wfack=1 wfdisk=1 op=osd_op(client.4105.1:1444 100000006ad.00000000 [write 0~2332 [1@-1]] 0.f1c210bd snapc 1=[])) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)', in thread '0x7f409a9a9700' osd/ReplicatedPG.cc: 2822: FAILED assert(repop_queue.front() == repop) ceph version 0.36-327-g3e92aac (commit:3e92aace21ecc766f14ac5a2c6377570988f1a3b) 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xff3) [0x4bfe23] 2: (ReplicatedPG::op_applied(ReplicatedPG::RepGather*)+0x4d4) [0x4cddc4] 3: (C_OSD_OpApplied::finish(int)+0x19) [0x539a29] 4: (Finisher::finisher_thread_entry()+0x248) [0x613758] 5: (Finisher::FinisherThread::entry()+0x15) [0x60d9a5] 6: (Thread::_entry_func(void*)+0x12) [0x615372] 7: (()+0x7971) [0x7f40a4c2a971] 8: (clone()+0x6d) [0x7f40a34ba92d]
On osd.1:
2011-10-19 00:12:22.065376 7f0fdd7c6700 -- 10.3.14.180:6801/1706 >> 10.3.14.178:6802/1733 pipe(0x2c08280 sd=16 pgs=1 cs=1 l=0).fault initiating reconnect 2011-10-19 00:12:28.159588 7f0fdffcb700 osd.1 4 pg[0.7( v 4'407 (4'270,4'407] n=407 ec=1 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'271 !hml active+clean] removing repgather(0x56805a0 applied 4'332 rep_tid=1768 wfack= wfdisk= op=osd_op(client.4105.1:2646 10000000c57.00000000 [write 0~39256 [1@-1]] 0.84ac7407 snapc 1=[])) 2011-10-19 00:12:28.159657 7f0fdffcb700 osd.1 4 pg[0.7( v 4'407 (4'270,4'407] n=407 ec=1 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'271 !hml active+clean] q front is repgather(0x6065360 applied 4'272 rep_tid=1522 wfack=0 wfdisk=0 op=osd_op(client.4105.1:2193 10000000a8c.00000000 [write 0~23 [1@-1]] 0.bbb9a107 snapc 1=[])) osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)', in thread '0x7f0fdffcb700' osd/ReplicatedPG.cc: 2822: FAILED assert(repop_queue.front() == repop) ceph version 0.36-327-g3e92aac (commit:3e92aace21ecc766f14ac5a2c6377570988f1a3b) 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xff3) [0x4bfe23] 2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x1fc) [0x4c01ec] 3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x188) [0x4c0798] 4: (ReplicatedPG::do_sub_op_reply(MOSDSubOpReply*)+0x54) [0x4d9214] 5: (OSD::dequeue_op(PG*)+0x4fa) [0x573e6a] 6: (OSD::OpWQ::_process(PG*)+0x15) [0x5d09c5] 7: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x59dfc2] 8: (ThreadPool::worker()+0x7e3) [0x66e353] 9: (ThreadPool::WorkThread::entry()+0x15) [0x5aae25] 10: (Thread::_entry_func(void*)+0x12) [0x615372] 11: (()+0x7971) [0x7f0fef357971] 12: (clone()+0x6d) [0x7f0fedbe792d]
Updated by Sage Weil over 12 years ago
- Status changed from New to Need More Info
need an osd log on this one
Updated by Sage Weil over 12 years ago
- Target version changed from v0.38 to v0.39
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position set to 90
Updated by Josh Durgin over 12 years ago
This happened again with the same workload in /var/lib/teuthworker/archive/nightly_coverage_2011-11-23-b/3034/remote/ubuntu@sepia5.ceph.dreamhost.com/log/osd.1.log.gz
Updated by Sage Weil over 12 years ago
- Priority changed from Normal to High
Ok, pretty sure this is related to the reconnect. We need to put together a test that artificially triggers messenger connection drops to test those paths thoroughly...
'ms inject socket failures = 100' or something.
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position deleted (
117) - Translation missing: en.field_position set to 10
Updated by Sage Weil over 12 years ago
- Target version changed from v0.39 to v0.40
Updated by Sage Weil over 12 years ago
- Target version deleted (
v0.40) - Translation missing: en.field_position deleted (
62) - Translation missing: en.field_position set to 29
Updated by Anonymous over 12 years ago
We haven't seen this, but hope that the messenger tests now being designed will flush it out again.
Updated by Sage Weil about 12 years ago
- Status changed from Need More Info to Can't reproduce
this code has been refactored a bit.
the messenger tests won't directly trigger this, though we may the/an underlying msgr bug that may have caused it. i don't think there's value in keeping this open.