Project

General

Profile

Bug #8504

msgr: FAILED assert(0 == "old msgs despite reconnect_seq feature")

Added by Greg Farnum about 5 years ago. Updated 6 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
06/02/2014
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

ubuntu@teuthology:/a/gregf-2014-05-29_16:19:17-rados-wip-client-fast-dispatch-testing-basic-plana/280874/remote/plana65/log/ceph-osd.0.log.gz

It looks to me like the other end restarted entirely and we didn't reset our in_seq appropriately.


Related issues

Duplicated by Ceph - Bug #8512: osd crashed at Pipe::reader in firefly Duplicate 06/02/2014
Duplicated by Ceph - Bug #8887: osd crashes at assert(e.version > info.last_update): PG:add_log_entry Duplicate 07/21/2014

History

#1 Updated by Greg Farnum about 5 years ago

Probably a bug I and Guang introduced in commit:bdee119076dd0eb65334840d141ccdf06091e3c9, but perhaps not?

#2 Updated by Samuel Just about 5 years ago

  • Priority changed from High to Urgent

#3 Updated by Samuel Just about 5 years ago

  • Assignee set to Greg Farnum

#4 Updated by Greg Farnum about 5 years ago

  • Status changed from New to In Progress
  • Priority changed from Urgent to High

Not a bug we introduced, but the existing patch only fixes the bug on the accept side. These crashes are on the connect side, which does not reset its expectations.

#5 Updated by Greg Farnum about 5 years ago

Okay, so the crashed OSD (osd.5) was marked down by the other endpoint (osd.4). Then osd.5 reconnected to osd.4, which for some reason allowed it. (osd.4 marked the connection down again as soon as it got a message delivered.)
osd.4 sent back an "existing_seq" of 0 as part of the final connection reply, which means that it had already been set to 0 in the existing Pipe's in_seq member. Which only happens if accept() detects "is_reset_from_peer" by comparing the provided and existing connect_seq (here, we get the connect_seq value we expect from a not-reset peer).

So it looks like osd.4 called mark_down on the addr-based interface and nuked the Pipe from handle_osd_map()...but then it sent a pg_info out along the Connection interface from a PG. Since it wasn't a lazy send_message(), that starts a new Pipe in connect().
Meanwhile, osd.5 tried to reconnect, and so osd.4's accept() races with its connect(), resulting in the behavior we're witnessing. That's definitely not supposed to happen; get_con_osd_cluster() uses get_nextmap_reserved() to make sure it's up-to-date. It could happen if we get that Connection (thereby dropping the map reservation), and then the Connection is marked down, and then we use it to send a Message — so that may be something we need to fix. But in this case, it doesn't look like that's what happened, because an earlier Message in the same thread appears to have been dropped from send_message_osd_cluster() due to the map marking the OSD out.
So I'm still trying to understand what's actually happened here; I'm going to check out the other failure and see if I can find anything else there.

#6 Updated by Greg Farnum about 5 years ago

The one I originally reported is much simpler:
osd.0 got marked down
osd.1 called mark_down and then forgot about it
osd.0 sent another message, opening a session
osd.1 said "oh, a new connection" and opened it up (with CEPH_MSGR_TAG_READY)
osd.0 kept around its old state, because it didn't get a RESET of any kind.

So I think we have two issues:
1) The OSD is (somehow) going backwards in time in terms of who it wants to connect to
2) the use of (mid-2012) policy.resetcheck is preventing us from exposing necessary information about when connections are actually going away. I want to discuss with Sam and Sage on these, respectively.

#7 Updated by Greg Farnum about 5 years ago

Greg Farnum wrote:

So it looks like osd.4 called mark_down on the addr-based interface and nuked the Pipe from handle_osd_map()...but then it sent a pg_info out along the Connection interface from a PG. Since it wasn't a lazy send_message(), that starts a new Pipe in connect().
Meanwhile, osd.5 tried to reconnect, and so osd.4's accept() races with its connect(), resulting in the behavior we're witnessing. That's definitely not supposed to happen; get_con_osd_cluster() uses get_nextmap_reserved() to make sure it's up-to-date. It could happen if we get that Connection (thereby dropping the map reservation), and then the Connection is marked down, and then we use it to send a Message — so that may be something we need to fix.

Realized in discussion with Sage that this can't happen — if the Connection still exists, it's marked as "failed" by the mark_down() call.

#8 Updated by Greg Farnum about 5 years ago

OSD bug:
1) In handle_osd_map(), we need to wait until nobody is using a (previously-)reserved map prior to doing any mark_down() calls. I think we can just insert a call to await_reserved_maps() here, but need to make sure.
2) The interface around OSDService::get_con_osd_cluster() is basically broken; we need a reserved map to stick around for as long as the ConnectionRef does. We can probably pull an RAII trick here...?

Messenger bug:
There is still an issue, illustrated in my original set of logs:
osd.a marks down osd.b and forgets about it (by not having any outstanding refs to the Connection)
osd.b sends a new message to osd.a, and attempts to reopen the Connection
osd.a doesn't have anything about osd.b, but happily echoes back the global and connect seqs it is given
osd.b crashes on the message seq values from osd.a, never having had a way to find out its Connection was reset on the other side :(

#9 Updated by Greg Farnum about 5 years ago

get_con_osd_cluster might be okay, since a marked_down Connection won't send any outgoing messages, which is what we want. The problem is doing the Connection lookups and having them race with a mark_down. (send_message_osd_cluster is broken in the same way.)

#10 Updated by Greg Farnum about 5 years ago

Of course the lookup is supposed to be protected by the map reservation system, but since we're doing mark_down prior to waiting for map reservations we don't have the guarantees we want. (That's just problem 1, making the rest of the stuff look busted.)

#11 Updated by Greg Farnum about 5 years ago

  • Status changed from In Progress to Testing

Okay, and after more discussion (thanks Sage!), the "Messenger bug" is the same as the OSD bug — the surviving OSD otherwise wouldn't try and send the dead OSD any Messages, and it only does so here because of the mark_down/get_connection() ordering.

#12 Updated by Greg Farnum about 5 years ago

  • Status changed from Testing to Pending Backport

Merged to next in commit:8697d6a10ee822e8a032a559c78e211e2182799a

Unfortunately back porting it will require a little more effort; there was some groundwork for fast dispatch which made it easier but doesn't exist in Firefly or Dumpling.

#13 Updated by Greg Farnum about 5 years ago

  • Status changed from Pending Backport to Resolved

This doesn't actually require any backports. :)

Prior to the fast dispatch work, the pre_publish_lock was held by both send_message_osd_cluster() and pre_publish_map(); the issue was triggered by changing send_message_osd_cluster to use the map reservation system instead.

#14 Updated by Greg Farnum 6 months ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)

Also available in: Atom PDF