Project

General

Profile

Actions

Bug #8880

closed

msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Sage Weil almost 10 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-07-18_02:32:01-rados-master-testing-basic-plana/368391

     0> 2014-07-18 06:51:15.386054 7f91762e3700 -1 msg/Pipe.cc: In function 'void Pipe::reader()' thread 7f91762e3700 time 2014-07-18 06:51:15.384818
msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 0.82-621-g0193d3a (0193d3aa29e912c00fcbde93ce5253afcc53534f)
 1: (Pipe::reader()+0x15d7) [0xb5e0b7]
 2: (Pipe::Reader::entry()+0xd) [0xb60d6d]
 3: (()+0x7e9a) [0x7f919a3b6e9a]
 4: (clone()+0x6d) [0x7f91989773fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions #1

Updated by Greg Farnum over 9 years ago

Which daemon was this?

Looks like that commit does include the fix for #8504... :(

Actions #2

Updated by Sage Weil over 9 years ago

  • Assignee set to Greg Farnum
Actions #3

Updated by Greg Farnum over 9 years ago

Sequence of events *****************
osd.0 =======
pipe to osd.1 faulted with nothing to send
scrub_should_schedule says yes, for pg0.2
send scrub-reserve message to peer osd.1
but first, send osdmap to them because we don't know if they're on epoch 13 like we are
receive scrub-reserve-ack from osd.1
but fail, because it's message ID 1, and we're way past that!

osd.1 =======
handle_osd_map(14..15), while on 13
mark_down osd.0 as a result of 14
... time passes ...
accept new connection (which is from osd.0)
set up new Session for it (and we know it's osd.0 now)
receive osdmap 13 from osd.0
receive scrub-reserve from osd.0, enqueue it
dequeue scrub-reserve, process it and send ack
notice osd.0 connection has faulted, attempt to reconnect (we haven't gotten ack for our message!)

I've checked and osd.0 was marked down as part of map 15, and was not marked back up again afterwards. So for some reason osd.1 is letting it send commands anyway, and responding to them.
Distressingly, it's also impossible for osd.1 to tell osd.0 it's down by sending along the new maps on its cluster network, because they will also be "old messages". (Although at least then osd.0 will not be running.) Maybe the heartbeat connections will transmit this information properly...?

Actions #4

Updated by Greg Farnum over 9 years ago

  • Status changed from New to In Progress

And indeed there's just nothing here making sure the peer is actually active, nor even that it's the leader. I'm adding a check for aliveness.

Actions #5

Updated by Greg Farnum over 9 years ago

  • Status changed from In Progress to Fix Under Review

wip-8880, PR https://github.com/ceph/ceph/pull/2135. It's untested and needs a suite run and review.

Actions #6

Updated by Greg Farnum over 9 years ago

  • Assignee changed from Greg Farnum to Sage Weil
Actions #7

Updated by Greg Farnum over 9 years ago

  • Status changed from Fix Under Review to In Progress
  • Assignee changed from Sage Weil to Greg Farnum
Actions #8

Updated by Greg Farnum over 9 years ago

  • Status changed from In Progress to Fix Under Review

New patches to split up the code more, as requested. :)

Actions #9

Updated by Sage Weil over 9 years ago

  • Status changed from Fix Under Review to 7
Actions #10

Updated by Sage Weil over 9 years ago

  • Assignee changed from Greg Farnum to Sage Weil
Actions #11

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-08-01_02:32:01-rados-master-testing-basic-plana/392461

Actions #12

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
Actions #13

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions #14

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF