Bug #8880: msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature") - Messengers - Ceph

Actions

Copy link

Bug #8880

closed

msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Sage Weil almost 10 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-07-18_02:32:01-rados-master-testing-basic-plana/368391

     0> 2014-07-18 06:51:15.386054 7f91762e3700 -1 msg/Pipe.cc: In function 'void Pipe::reader()' thread 7f91762e3700 time 2014-07-18 06:51:15.384818
msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature")

 ceph version 0.82-621-g0193d3a (0193d3aa29e912c00fcbde93ce5253afcc53534f)
 1: (Pipe::reader()+0x15d7) [0xb5e0b7]
 2: (Pipe::Reader::entry()+0xd) [0xb60d6d]
 3: (()+0x7e9a) [0x7f919a3b6e9a]
 4: (clone()+0x6d) [0x7f91989773fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

Which daemon was this?

Looks like that commit does include the fix for #8504... :(

Actions

Copy link

Updated by Sage Weil almost 10 years ago

Assignee set to Greg Farnum

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

Sequence of events *****************
osd.0 =======
pipe to osd.1 faulted with nothing to send
scrub_should_schedule says yes, for pg0.2
send scrub-reserve message to peer osd.1
but first, send osdmap to them because we don't know if they're on epoch 13 like we are
receive scrub-reserve-ack from osd.1
but fail, because it's message ID 1, and we're way past that!

osd.1 =======
handle_osd_map(14..15), while on 13
mark_down osd.0 as a result of 14
... time passes ...
accept new connection (which is from osd.0)
set up new Session for it (and we know it's osd.0 now)
receive osdmap 13 from osd.0
receive scrub-reserve from osd.0, enqueue it
dequeue scrub-reserve, process it and send ack
notice osd.0 connection has faulted, attempt to reconnect (we haven't gotten ack for our message!)

I've checked and osd.0 was marked down as part of map 15, and was not marked back up again afterwards. So for some reason osd.1 is letting it send commands anyway, and responding to them.
Distressingly, it's also impossible for osd.1 to tell osd.0 it's down by sending along the new maps on its cluster network, because they will also be "old messages". (Although at least then osd.0 will not be running.) Maybe the heartbeat connections will transmit this information properly...?

Actions

Copy link