Bug #505
closedosd assert on flab
0%
Description
When running a few tests with radostool, I hit an assert in the OSD:
(12:12:48 PM) colinm@newdream.net/: osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)':
osd/ReplicatedPG.cc:2024: FAILED assert(repop_queue.front() == repop)
ceph version 0.22 (commit:032d5055483bbab4c6512c7b7b487707645da4ef)
1: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x84b) [0x683d83]
2: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, eversion_t)+0x25c) [0x684004]
3: (ReplicatedPG::sub_op_modify_reply(MOSDSubOpReply*)+0xfd) [0x68410d]
4: (ReplicatedPG::do_sub_op_reply(MOSDSubOpReply*)+0x96) [0x68db00]
5: (OSD::dequeue_op(PG*)+0x26f) [0x6dfe23]
6: (OSD::OpWQ::_process(PG*)+0x21) [0x75f47f]
7: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x28) [0x71cda0]
8: (ThreadPool::worker()+0x23a) [0x820f84]
9: (ThreadPool::WorkThread::entry()+0x19) [0x762533]
10: (Thread::_entry_func(void*)+0x20) [0x667f28]
11: (()+0x68ba) [0x7f5e3521d8ba]
12: (clone()+0x6d) [0x7f5e341d102d]
configuration: vstart.sh standard with 2 OSDs.
logs can be found in flab.ceph.dreamhost.com:/home/logs/out-rados-hang
Updated by Sage Weil over 13 years ago
osd.1 is replying out of order:
2010-10-19 10:33:01.007428 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4373.0:1 0.2 obj094/head [] v 7'42 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000370 0x21c6a00 2010-10-19 10:33:01.284541 7f5e2dcc6710 -- 10.3.14.10:6802/11862 <== osd0 10.3.14.10:6800/11814 336 ==== osd_sub_op_reply(client4373.0:1 0.2 obj094/head [] ondisk = 0) v1 ==== 113+0+0 (1460257441 0 0) 0x1f23480 2010-10-19 11:23:37.697189 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4416.0:1 0.2 obj095b/head [] v 7'43 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x21c6000 2010-10-19 11:40:30.269613 7f5e2bbc1710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4447.0:1 0.2 obj095c/head [] v 7'44 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x1ecba00 2010-10-19 11:51:11.649479 7f5e2c3c2710 -- 10.3.14.10:6802/11862 --> osd0 10.3.14.10:6800/11814 -- osd_sub_op(client4450.0:1 0.2 obj095c/head [] v 7'45 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000375 0x230a000 2010-10-19 11:51:11.681156 7f5e2dcc6710 -- 10.3.14.10:6802/11862 <== osd0 10.3.14.10:6800/11814 1 ==== osd_sub_op_reply(client4450.0:1 0.2 obj095c/head [] ondisk = 0) v1 ==== 114+0+0 (1770954097 0 0) 0x1f23480
Updated by Sage Weil over 13 years ago
but osd.0 didn't see those two it skipped:
2010-10-19 10:33:01.010169 7f7d424dd710 -- 10.3.14.10:6800/11814 <== osd1 10.3.14.10:6802/11862 329 ==== osd_sub_op(client4373.0:1 0.2 obj094/head [] v 7'42 snapset=0=[]:[] snapc=0=[]) v2 ==== 485+0+1000370 (1739395167 0 1383916271) 0x30dc000 2010-10-19 10:33:01.284335 7f7d42cde710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op_reply(client4373.0:1 0.2 obj094/head [] ondisk = 0) v1 -- ?+0 0x310b6c0 2010-10-19 11:51:11.652418 7f7d424dd710 -- 10.3.14.10:6800/11814 <== osd1 10.3.14.10:6802/11862 1 ==== osd_sub_op(client4450.0:1 0.2 obj095c/head [] v 7'45 snapset=0=[]:[] snapc=0=[]) v2 ==== 487+0+1000375 (2335475176 0 3997890699) 0x30cda00 2010-10-19 11:51:11.680912 7f7d42cde710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op_reply(client4450.0:1 0.2 obj095c/head [] ondisk = 0) v1 -- ?+0 0x2d42b40
Updated by Sage Weil over 13 years ago
because:
2010-10-19 10:48:04.501035 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=2 cs=1 l=0).fault with nothing to send, going to standby 2010-10-19 10:50:16.606010 7f7d403d8710 -- 10.3.14.10:6800/11814 --> osd1 10.3.14.10:6802/11862 -- osd_sub_op(client4413.0:1 0.3 obj095a/head [] v 7'40 snapset=0=[]:[] snapc=0=[]) v1 -- ?+1000331 0x30cda00 2010-10-19 10:50:16.609533 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=2 cs=2 l=0).fault first fault 2010-10-19 10:50:16.610042 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=15 pgs=2 cs=2 l=0).connect got RESETSESSION 2010-10-19 11:05:16.710248 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x3191000 sd=-1 pgs=141 cs=1 l=0).fault with nothing to send, going to standby 2010-10-19 11:23:37.697763 7f7d3cbca710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5a00 sd=13 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3 2010-10-19 11:23:37.700962 7f7d3cac9710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5780 sd=15 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION 2010-10-19 11:38:37.800590 7f7d3cac9710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5780 sd=-1 pgs=148 cs=1 l=0).fault with nothing to send, going to standby 2010-10-19 11:40:30.270364 7f7d3d1d0710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5500 sd=13 pgs=0 cs=0 l=0).accept connect_seq 2 vs existing 1 state 3 2010-10-19 11:40:30.270895 7f7d3d3d2710 -- 10.3.14.10:6800/11814 >> 10.3.14.10:6802/11862 pipe(0x2df5280 sd=15 pgs=0 cs=0 l=0).accept we reset (peer sent cseq 2), sending RESETSESSION
???
Updated by Greg Farnum over 13 years ago
- Assignee set to Greg Farnum
Sage thinks it's a problem in reconnect, where the messenger is dropping messages which causes the OSD assert.
Updated by Greg Farnum over 13 years ago
- Status changed from New to In Progress
The error that exposed this was introduced in 8528ebb0c6286eb6660773fcaf29d1cccd98d72c, but the root cause is that we can lose queued messages if an existing pipe is taken over in accept, but then we (new pipe) fail.
First attempt at a fix is pushed to msgr branch, but it doesn't pass all the tests I'd expect it to so it's not working (at least, all the way).
Updated by Greg Farnum over 13 years ago
- Status changed from In Progress to Resolved
Well, that was a Duh.
Fixed in 49d8fd8a21778d0f805176d670d5f63f14e36b47 and 696da81588621ac9ee256993a184c97fb2cb2e35. Pushed to msgr and unstable branches.