Project

General

Profile

Actions

Bug #1631

closed

osd: failed assert(repop_queue.front() == repop)

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This happened on two osds during a multiple_rsync workunit (teuthology:~teuthworker/archive/nightly_coverage_2011-10-19/656/). On osd.0:

2011-10-19 00:12:28.417577 7f409a9a9700 osd.0 4 pg[0.5( v 4'353 (4'183,4'353] n=353 ec=1 les/c 0/4 3/3/2) [0,1] r=0 luod=4'331 lcod 4'331 mlcod 4'260 !hml active+clean]  removing repgather(0x520f6c0 applied 4'259 rep_tid=1252 wfack= wfdisk= op=osd_op(client.4105.1:2055 100000009f8.00000000 [write 0~1925 [1@-1]] 0.ecaaaa2d snapc 1=[]))
2011-10-19 00:12:28.417615 7f409a9a9700 osd.0 4 pg[0.5( v 4'353 (4'183,4'353] n=353 ec=1 les/c 0/4 3/3/2) [0,1] r=0 luod=4'331 lcod 4'331 mlcod 4'260 !hml active+clean]    q front is repgather(0x26ee900 applied 4'185 rep_tid=940 wfack=1 wfdisk=1 op=osd_op(client.4105.1:1444 100000006ad.00000000 [write 0~2332 [1@-1]] 0.f1c210bd snapc 1=[]))
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)', in thread '0x7f409a9a9700'
osd/ReplicatedPG.cc: 2822: FAILED assert(repop_queue.front() == repop)
 ceph version 0.36-327-g3e92aac (commit:3e92aace21ecc766f14ac5a2c6377570988f1a3b)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xff3) [0x4bfe23]
 2: (ReplicatedPG::op_applied(ReplicatedPG::RepGather*)+0x4d4) [0x4cddc4]
 3: (C_OSD_OpApplied::finish(int)+0x19) [0x539a29]
 4: (Finisher::finisher_thread_entry()+0x248) [0x613758]
 5: (Finisher::FinisherThread::entry()+0x15) [0x60d9a5]
 6: (Thread::_entry_func(void*)+0x12) [0x615372]
 7: (()+0x7971) [0x7f40a4c2a971]
 8: (clone()+0x6d) [0x7f40a34ba92d]

On osd.1:

2011-10-19 00:12:22.065376 7f0fdd7c6700 -- 10.3.14.180:6801/1706 >> 10.3.14.178:6802/1733 pipe(0x2c08280 sd=16 pgs=1 cs=1 l=0).fault initiating reconnect
2011-10-19 00:12:28.159588 7f0fdffcb700 osd.1 4 pg[0.7( v 4'407 (4'270,4'407] n=407 ec=1 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'271 !hml active+clean]  removing repgather(0x56805a0 applied 4'332 rep_tid=1768 wfack= wfdisk= op=osd_op(client.4105.1:2646 10000000c57.00000000 [write 0~39256 [1@-1]] 0.84ac7407 snapc 1=[]))
2011-10-19 00:12:28.159657 7f0fdffcb700 osd.1 4 pg[0.7( v 4'407 (4'270,4'407] n=407 ec=1 les/c 4/4 3/3/3) [1,0] r=0 mlcod 4'271 !hml active+clean]    q front is repgather(0x6065360 applied 4'272 rep_tid=1522 wfack=0 wfdisk=0 op=osd_op(client.4105.1:2193 10000000a8c.00000000 [write 0~23 [1@-1]] 0.bbb9a107 snapc 1=[]))
osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)', in thread '0x7f0fdffcb700'
osd/ReplicatedPG.cc: 2822: FAILED assert(repop_queue.front() == repop)
 ceph version 0.36-327-g3e92aac (commit:3e92aace21ecc766f14ac5a2c6377570988f1a3b)
 1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0xff3) [0x4bfe23]
 2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x1fc) [0x4c01ec]
 3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0x188) [0x4c0798]
 4: (ReplicatedPG::do_sub_op_reply(MOSDSubOpReply*)+0x54) [0x4d9214]
 5: (OSD::dequeue_op(PG*)+0x4fa) [0x573e6a]
 6: (OSD::OpWQ::_process(PG*)+0x15) [0x5d09c5]
 7: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x59dfc2]
 8: (ThreadPool::worker()+0x7e3) [0x66e353]
 9: (ThreadPool::WorkThread::entry()+0x15) [0x5aae25]
 10: (Thread::_entry_func(void*)+0x12) [0x615372]
 11: (()+0x7971) [0x7f0fef357971]
 12: (clone()+0x6d) [0x7f0fedbe792d]

Related issues 1 (0 open1 closed)

Related to Ceph - Feature #1412: qa: spec out messenger testingResolved08/18/2011

Actions
Actions #1

Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info

need an osd log on this one

Actions #2

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.38 to v0.39
Actions #3

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 90
Actions #4

Updated by Josh Durgin over 12 years ago

This happened again with the same workload in /var/lib/teut/log/osd.1.log.gz

Actions #5

Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High

Ok, pretty sure this is related to the reconnect. We need to put together a test that artificially triggers messenger connection drops to test those paths thoroughly...

'ms inject socket failures = 100' or something.

Actions #6

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position deleted (117)
  • Translation missing: en.field_position set to 10
Actions #7

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40
Actions #8

Updated by Sage Weil over 12 years ago

  • Priority changed from High to Normal
Actions #9

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position deleted (62)
  • Translation missing: en.field_position set to 29
Actions #10

Updated by Anonymous about 12 years ago

We haven't seen this, but hope that the messenger tests now being designed will flush it out again.

Actions #11

Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to Can't reproduce

this code has been refactored a bit.

the messenger tests won't directly trigger this, though we may the/an underlying msgr bug that may have caused it. i don't think there's value in keeping this open.

Actions

Also available in: Atom PDF