Actions
Bug #10057
closedmsgr: skipped message on peer reconnect
Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
ubuntu@teuthology:/a/teuthology-2014-11-09_23:06:01-krbd-next-testing-basic-multi/593102
-151> 2014-11-10 02:47:43.361223 7fca9f899700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fca9f899700 time 2014-11-10 02:47:43.353904 osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop) ceph version 0.87-573-g6977d02 (6977d02f0d31c453cdf554a8f1796f290c1a3b89) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa870f] 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x1205) [0x7ef125] 3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xaf) [0x7ef50f] 4: (Context::complete(int)+0x9) [0x679689] 5: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x28a) [0x93d89a] 6: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x504) [0x93e214] 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x15a) [0x7d571a] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x656f9f] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x6579ff] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x652) [0xa99352] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa9aa80] 12: (()+0x7e9a) [0x7fcab96bde9a] 13: (clone()+0x6d) [0x7fcab7e6731d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Sage Weil over 9 years ago
- Subject changed from osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop) to msgr: skipped message on peer reconnect
- Status changed from New to In Progress
- Assignee set to Sage Weil
-346> 2014-11-10 02:47:43.352252 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5910 ==== osd_sub_op_reply(client.4115.1:23700 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3360594615 0 0) 0x9d93b00 con 0x4ee9580 -340> 2014-11-10 02:47:43.352346 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7 -336> 2014-11-10 02:47:43.352441 7fca93056700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).injecting socket failure -334> 2014-11-10 02:47:43.352557 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5911 ==== osd_sub_op_reply(client.4115.1:23701 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (1067334680 0 0) 0x9d93dc0 con 0x4ee9580 -325> 2014-11-10 02:47:43.352638 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7 -322> 2014-11-10 02:47:43.352647 7fca93359700 1 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).do_sendmsg error (104) Connection reset by peer -320> 2014-11-10 02:47:43.352692 7fca93359700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).fault with nothing to send, going to standby -315> 2014-11-10 02:47:43.352778 7fca93056700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5913 ==== osd_sub_op_reply(client.4115.1:23703 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3347756399 0 0) 0x9d93840 con 0x4ee9580 -307> 2014-11-10 02:47:43.352862 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7 -259> 2014-11-10 02:47:43.356378 7fca93d63700 10 osd.4 7 new session 0x6fb6280 con=0x8d93420 addr=10.214.131.38:6805/6358 -257> 2014-11-10 02:47:43.356442 7fca93d63700 0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x4ef3500 sd=123 :6801 s=0 pgs=0 cs=0 l=0 c=0x8d93420).accept connect_seq 268 vs existing 267 state standby -255> 2014-11-10 02:47:43.358457 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5914 ==== osd_sub_op_reply(client.4115.1:23704 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (956406496 0 0) 0x90d4840 con 0x4ee9580 -249> 2014-11-10 02:47:43.359241 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7 -247> 2014-11-10 02:47:43.359363 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5915 ==== osd_sub_op_reply(client.4115.1:23705 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3699936919 0 0) 0x9d94080 con 0x4ee9580 -241> 2014-11-10 02:47:43.359445 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7 -239> 2014-11-10 02:47:43.359549 7fca93d63700 1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5916 ==== osd_sub_op_reply(client.4115.1:23706 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3239383959 0 0) 0x6ae8b00 con 0x4ee9580
Updated by Greg Farnum about 9 years ago
- Assignee changed from Sage Weil to Greg Farnum
This has also been seen on the AsyncMessenger at /a/sage-2015-03-21_07:31:12-rados-master-distro-basic-multi/814945. I'll try again tomorrow (Friday).
Actions