Bug #8880
closedmsg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature")
0%
Description
ubuntu@teuthology:/a/teuthology-2014-07-18_02:32:01-rados-master-testing-basic-plana/368391
0> 2014-07-18 06:51:15.386054 7f91762e3700 -1 msg/Pipe.cc: In function 'void Pipe::reader()' thread 7f91762e3700 time 2014-07-18 06:51:15.384818 msg/Pipe.cc: 1538: FAILED assert(0 == "old msgs despite reconnect_seq feature") ceph version 0.82-621-g0193d3a (0193d3aa29e912c00fcbde93ce5253afcc53534f) 1: (Pipe::reader()+0x15d7) [0xb5e0b7] 2: (Pipe::Reader::entry()+0xd) [0xb60d6d] 3: (()+0x7e9a) [0x7f919a3b6e9a] 4: (clone()+0x6d) [0x7f91989773fd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Greg Farnum almost 10 years ago
Which daemon was this?
Looks like that commit does include the fix for #8504... :(
Updated by Greg Farnum almost 10 years ago
Sequence of events
*****************
osd.0
=======
pipe to osd.1 faulted with nothing to send
scrub_should_schedule says yes, for pg0.2
send scrub-reserve message to peer osd.1
but first, send osdmap to them because we don't know if they're on epoch 13 like we are
receive scrub-reserve-ack from osd.1
but fail, because it's message ID 1, and we're way past that!
osd.1
=======
handle_osd_map(14..15), while on 13
mark_down osd.0 as a result of 14
... time passes ...
accept new connection (which is from osd.0)
set up new Session for it (and we know it's osd.0 now)
receive osdmap 13 from osd.0
receive scrub-reserve from osd.0, enqueue it
dequeue scrub-reserve, process it and send ack
notice osd.0 connection has faulted, attempt to reconnect (we haven't gotten ack for our message!)
I've checked and osd.0 was marked down as part of map 15, and was not marked back up again afterwards. So for some reason osd.1 is letting it send commands anyway, and responding to them.
Distressingly, it's also impossible for osd.1 to tell osd.0 it's down by sending along the new maps on its cluster network, because they will also be "old messages". (Although at least then osd.0 will not be running.) Maybe the heartbeat connections will transmit this information properly...?
Updated by Greg Farnum almost 10 years ago
- Status changed from New to In Progress
And indeed there's just nothing here making sure the peer is actually active, nor even that it's the leader. I'm adding a check for aliveness.
Updated by Greg Farnum almost 10 years ago
- Status changed from In Progress to Fix Under Review
wip-8880, PR https://github.com/ceph/ceph/pull/2135. It's untested and needs a suite run and review.
Updated by Greg Farnum almost 10 years ago
- Assignee changed from Greg Farnum to Sage Weil
Updated by Greg Farnum over 9 years ago
- Status changed from Fix Under Review to In Progress
- Assignee changed from Sage Weil to Greg Farnum
Updated by Greg Farnum over 9 years ago
- Status changed from In Progress to Fix Under Review
New patches to split up the code more, as requested. :)
Updated by Sage Weil over 9 years ago
- Assignee changed from Greg Farnum to Sage Weil
Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-08-01_02:32:01-rados-master-testing-basic-plana/392461
Updated by Sage Weil over 9 years ago
- Status changed from 7 to Pending Backport
Updated by Sage Weil over 9 years ago
- Status changed from Pending Backport to Resolved
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to Messengers
- Category deleted (
msgr)