Bug #505: osd assert on flab - Ceph - Ceph

Actions

Copy link

Bug #505

closed

osd assert on flab

Added by Colin McCabe over 13 years ago. Updated over 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When running a few tests with radostool, I hit an assert in the OSD:

(12:12:48 PM) colinm@newdream.net/: osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)':
osd/ReplicatedPG.cc:2024: FAILED assert(repop_queue.front() == repop)
ceph version 0.22 (commit:032d5055483bbab4c6512c7b7b487707645da4ef)
1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x84b) [0x683d83]
2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x25c) [0x684004]
3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0xfd) [0x68410d]
4: (ReplicatedPG::do_sub_op_reply(MOSDSubOpReply*)+0x96) [0x68db00]
5: (OSD::dequeue_op(PG*)+0x26f) [0x6dfe23]
6: (OSD::OpWQ::_process(PG*)+0x21) [0x75f47f]
7: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x28) [0x71cda0]
8: (ThreadPool::worker()+0x23a) [0x820f84]
9: (ThreadPool::WorkThread::entry()+0x19) [0x762533]
10: (Thread::_entry_func(void*)+0x20) [0x667f28]
11: (()+0x68ba) [0x7f5e3521d8ba]
12: (clone()+0x6d) [0x7f5e341d102d]

configuration: vstart.sh standard with 2 OSDs.

logs can be found in flab.ceph.dreamhost.com:/home/logs/out-rados-hang

Actions

Copy link

Updated by Sage Weil over 13 years ago

osd.1 is replying out of order:

2010-10-19 10:33:01.007428 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4373.0:1 0.2 obj094/head [] v 7'42 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000370 0x21c6a00
2010-10-19 10:33:01.284541 7f5e2dcc6710 -- 10.3.14.10:6802/11862 <== osd0 10.3.14.10:6800/11814 336 ==== osd_sub_op_reply(client4373.0:1 0.2 obj094/head [] ondisk = 0) v1 ==== 113+0+0 (1460257441 0 0) 0x1f23480
2010-10-19 11:23:37.697189 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4416.0:1 0.2 obj095b/head [] v 7'43 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x21c6000
2010-10-19 11:40:30.269613 7f5e2bbc1710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4447.0:1 0.2 obj095c/head [] v 7'44 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x1ecba00
2010-10-19 11:51:11.649479 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4450.0:1 0.2 obj095c/head [] v 7'45 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000375 0x230a000
2010-10-19 11:51:11.681156 7f5e2dcc6710 -- 10.3.14.10:6802/11862 <== osd0 10.3.14.10:6800/11814 1 ==== osd_sub_op_reply(client4450.0:1 0.2 obj095c/head [] ondisk = 0) v1 ==== 114+0+0 (1770954097 0 0) 0x1f23480

Actions

Copy link

Updated by Sage Weil over 13 years ago

but osd.0 didn't see those two it skipped:

2010-10-19 10:33:01.010169 7f7d424dd710 -- 10.3.14.10:6800/11814 <== osd1 10.3.14.10:6802/11862 329 ==== osd_sub_op(client4373.0:1 0.2 obj094/head [] v 7'42 snapset=0=[]:[] snapc=0=[]) v2 ==== 485+0+1000370 (1739395167 0 1383916271) 0x30dc000
2010-10-19 10:33:01.284335 7f7d42cde710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op_reply(client4373.0:1 0.2 obj094/head [] ondisk = 0) v1 -- ?+0 0x310b6c0
2010-10-19 11:51:11.652418 7f7d424dd710 -- 10.3.14.10:6800/11814 <== osd1 10.3.14.10:6802/11862 1 ==== osd_sub_op(client4450.0:1 0.2 obj095c/head [] v 7'45 snapset=0=[]:[] snapc=0=[]) v2 ==== 487+0+1000375 (2335475176 0 3997890699) 0x30cda00
2010-10-19 11:51:11.680912 7f7d42cde710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op_reply(client4450.0:1 0.2 obj095c/head [] ondisk = 0) v1 -- ?+0 0x2d42b40

Actions

Copy link

Updated by Sage Weil over 13 years ago

because:


2010-10-19 10:48:04.501035 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=2 cs=1 l=0).fault with nothing to send, going to standby
2010-10-19 10:50:16.606010 7f7d403d8710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op(client4413.0:1 0.3 obj095a/head [] v 7'40 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x30cda00
2010-10-19 10:50:16.609533 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=2 cs=2 l=0).fault first fault
2010-10-19 10:50:16.610042 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=15 pgs=2 cs=2 l=0).connect got RESETSESSION
2010-10-19 11:05:16.710248 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=141 cs=1 l=0).fault with nothing to send, going to standby
2010-10-19 11:23:37.697763 7f7d3cbca710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5a00 sd=13 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3
2010-10-19 11:23:37.700962 7f7d3cac9710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5780 sd=15 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
2010-10-19 11:38:37.800590 7f7d3cac9710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5780 sd=-1 pgs=148 cs=1 l=0).fault with nothing to send, going to standby
2010-10-19 11:40:30.270364 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5500 sd=13 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3
2010-10-19 11:40:30.270895 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5280 sd=15 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION

???

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Assignee set to Greg Farnum

Sage thinks it's a problem in reconnect, where the messenger is dropping messages which causes the OSD assert.

Actions

Copy link

Updated by Greg Farnum over 13 years ago

Status changed from New to In Progress

The error that exposed this was introduced in 8528ebb0c6286eb6660773fcaf29d1cccd98d72c, but the root cause is that we can lose queued messages if an existing pipe is taken over in accept, but then we (new pipe) fail.
First attempt at a fix is pushed to msgr branch, but it doesn't pass all the tests I'd expect it to so it's not working (at least, all the way).

Actions

Copy link