Bug #10057
closed
msgr: skipped message on peer reconnect
Added by Sage Weil over 9 years ago.
Updated over 8 years ago.
Description
ubuntu@teuthology:/a/teuthology-2014-11-09_23:06:01-krbd-next-testing-basic-multi/593102
-151> 2014-11-10 02:47:43.361223 7fca9f899700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fca9f899700 time 2014-11-10 02:47:43.353904
osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop)
ceph version 0.87-573-g6977d02 (6977d02f0d31c453cdf554a8f1796f290c1a3b89)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa870f]
2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x1205) [0x7ef125]
3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xaf) [0x7ef50f]
4: (Context::complete(int)+0x9) [0x679689]
5: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x28a) [0x93d89a]
6: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x504) [0x93e214]
7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x15a) [0x7d571a]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x656f9f]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x6579ff]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x652) [0xa99352]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa9aa80]
12: (()+0x7e9a) [0x7fcab96bde9a]
13: (clone()+0x6d) [0x7fcab7e6731d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
- Subject changed from osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop) to msgr: skipped message on peer reconnect
- Status changed from New to In Progress
- Assignee set to Sage Weil
-346> 2014-11-10 02:47:43.352252 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5910 ==== osd_sub_op_reply(client.4115.1:23700 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3360594615 0 0) 0x9d93b00 con 0x4ee9580
-340> 2014-11-10 02:47:43.352346 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
-336> 2014-11-10 02:47:43.352441 7fca93056700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).injecting socket failure
-334> 2014-11-10 02:47:43.352557 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5911 ==== osd_sub_op_reply(client.4115.1:23701 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (1067334680 0 0) 0x9d93dc0 con 0x4ee9580
-325> 2014-11-10 02:47:43.352638 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
-322> 2014-11-10 02:47:43.352647 7fca93359700 1 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).do_sendmsg error (104) Connection reset by peer
-320> 2014-11-10 02:47:43.352692 7fca93359700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).fault with nothing to send, going to standby
-315> 2014-11-10 02:47:43.352778 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5913 ==== osd_sub_op_reply(client.4115.1:23703 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3347756399 0 0) 0x9d93840 con 0x4ee9580
-307> 2014-11-10 02:47:43.352862 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
-259> 2014-11-10 02:47:43.356378 7fca93d63700 10 osd.4 7 new session 0x6fb6280 con=0x8d93420 addr=10.214.131.38:6805/6358
-257> 2014-11-10 02:47:43.356442 7fca93d63700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x4ef3500 sd=123 :6801 s=0 pgs=0 cs=0 l=0 c=0x8d93420).accept connect_seq 268 vs existing 267 state standby
-255> 2014-11-10 02:47:43.358457 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5914 ==== osd_sub_op_reply(client.4115.1:23704 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (956406496 0 0) 0x90d4840 con 0x4ee9580
-249> 2014-11-10 02:47:43.359241 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
-247> 2014-11-10 02:47:43.359363 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5915 ==== osd_sub_op_reply(client.4115.1:23705 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3699936919 0 0) 0x9d94080 con 0x4ee9580
-241> 2014-11-10 02:47:43.359445 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
-239> 2014-11-10 02:47:43.359549 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5916 ==== osd_sub_op_reply(client.4115.1:23706 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3239383959 0 0) 0x6ae8b00 con 0x4ee9580
- Status changed from In Progress to 12
- Assignee changed from Sage Weil to Greg Farnum
This has also been seen on the AsyncMessenger at /a/sage-2015-03-21_07:31:12-rados-master-distro-basic-multi/814945. I'll try again tomorrow (Friday).
- Assignee deleted (
Greg Farnum)
- Status changed from 12 to Can't reproduce
Also available in: Atom
PDF