Project

General

Profile

Actions

Bug #10057

closed

msgr: skipped message on peer reconnect

Added by Sage Weil over 9 years ago. Updated over 8 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-11-09_23:06:01-krbd-next-testing-basic-multi/593102

  -151> 2014-11-10 02:47:43.361223 7fca9f899700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fca9f899700 time 2014-11-10 02:47:43.353904
osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop)

 ceph version 0.87-573-g6977d02 (6977d02f0d31c453cdf554a8f1796f290c1a3b89)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa870f]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x1205) [0x7ef125]
 3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xaf) [0x7ef50f]
 4: (Context::complete(int)+0x9) [0x679689]
 5: (ReplicatedBackend::sub_op_modify_reply(std::tr1::shared_ptr<OpRequest>)+0x28a) [0x93d89a]
 6: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x504) [0x93e214]
 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x15a) [0x7d571a]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x656f9f]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x6579ff]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x652) [0xa99352]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa9aa80]
 12: (()+0x7e9a) [0x7fcab96bde9a]
 13: (clone()+0x6d) [0x7fcab7e6731d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #10118: messenger drops messages between osdsCan't reproduceSamuel Just11/17/2014

Actions
Actions #1

Updated by Sage Weil over 9 years ago

  • Subject changed from osd/ReplicatedPG.cc: 6929: FAILED assert(repop_queue.front() == repop) to msgr: skipped message on peer reconnect
  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  -346> 2014-11-10 02:47:43.352252 7fca93056700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5910 ==== osd_sub_op_reply(client.4115.1:23700 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3360594615 0 0) 0x9d93b00 con 0x4ee9580
  -340> 2014-11-10 02:47:43.352346 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
  -336> 2014-11-10 02:47:43.352441 7fca93056700  0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).injecting socket failure
  -334> 2014-11-10 02:47:43.352557 7fca93056700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5911 ==== osd_sub_op_reply(client.4115.1:23701 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (1067334680 0 0) 0x9d93dc0 con 0x4ee9580
  -325> 2014-11-10 02:47:43.352638 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
  -322> 2014-11-10 02:47:43.352647 7fca93359700  1 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).do_sendmsg error (104) Connection reset by peer
  -320> 2014-11-10 02:47:43.352692 7fca93359700  0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x51f5500 sd=32 :6801 s=2 pgs=350 cs=267 l=0 c=0x4ee9580).fault with nothing to send, going to standby
  -315> 2014-11-10 02:47:43.352778 7fca93056700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5913 ==== osd_sub_op_reply(client.4115.1:23703 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3347756399 0 0) 0x9d93840 con 0x4ee9580
  -307> 2014-11-10 02:47:43.352862 7fca93056700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
  -259> 2014-11-10 02:47:43.356378 7fca93d63700 10 osd.4 7  new session 0x6fb6280 con=0x8d93420 addr=10.214.131.38:6805/6358
  -257> 2014-11-10 02:47:43.356442 7fca93d63700  0 -- 10.214.131.38:6801/6357 >> 10.214.131.38:6805/6358 pipe(0x4ef3500 sd=123 :6801 s=0 pgs=0 cs=0 l=0 c=0x8d93420).accept connect_seq 268 vs existing 267 state standby
  -255> 2014-11-10 02:47:43.358457 7fca93d63700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5914 ==== osd_sub_op_reply(client.4115.1:23704 0.b 4d78263b/rb.0.100b.749abb43.000000000d5b/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (956406496 0 0) 0x90d4840 con 0x4ee9580
  -249> 2014-11-10 02:47:43.359241 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
  -247> 2014-11-10 02:47:43.359363 7fca93d63700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5915 ==== osd_sub_op_reply(client.4115.1:23705 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3699936919 0 0) 0x9d94080 con 0x4ee9580
  -241> 2014-11-10 02:47:43.359445 7fca93d63700 20 osd.4 7 should_share_map osd.3 10.214.131.38:6805/6358 7
  -239> 2014-11-10 02:47:43.359549 7fca93d63700  1 -- 10.214.131.38:6801/6357 <== osd.3 10.214.131.38:6805/6358 5916 ==== osd_sub_op_reply(client.4115.1:23706 0.b 4e5387b/rb.0.100b.749abb43.000000000d5c/head//0 [] ondisk, result = 0) v2 ==== 169+0+0 (3239383959 0 0) 0x6ae8b00 con 0x4ee9580

Actions #2

Updated by Sage Weil about 9 years ago

  • Status changed from In Progress to 12
Actions #3

Updated by Samuel Just about 9 years ago

nvm

Actions #4

Updated by Greg Farnum about 9 years ago

  • Assignee changed from Sage Weil to Greg Farnum

This has also been seen on the AsyncMessenger at /a/sage-2015-03-21_07:31:12-rados-master-distro-basic-multi/814945. I'll try again tomorrow (Friday).

Actions #5

Updated by Sage Weil almost 9 years ago

  • Assignee deleted (Greg Farnum)
Actions #6

Updated by Sage Weil over 8 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF