Project

General

Profile

Bug #8232

Race condition during messenger rebind

Added by Guang Yang almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When the system is in high load, we observed an assertion failure as below:
-----------------------------------------------
11> 2014-04-24 11:16:13.124561 7ff4ecd78700 0 - 10.193.207.178:6909/1048024

10.193.207.177:6849/27341 pipe(0xc12d180 sd

=55 :52037 s=2 pgs=1734 cs=11 l=0 c=0x13cf62c0).fault, initiating reconnect
10> 2014-04-24 11:16:13.124613 7ff4fe1a2700 1 - 10.193.207.178:0/48024
mark_down 0xb0b8840 -- pipe dne
9> 2014-04-24 11:16:13.124632 7ff4fe1a2700 1 - 10.193.207.178:0/48024
mark_down 0xb0b8580 -- pipe dne
8> 2014-04-24 11:16:13.124629 7ff4e600b700 0 -
10.193.207.178:6909/1048024 >> 10.193.207.186:6809/5357 pipe(0xc007b80 sd=
21 :39223 s=2 pgs=1693 cs=15 l=0 c=0x13fdfa20).fault, initiating reconnect
7> 2014-04-24 11:16:13.124624 7ff4cd784700 -1 msg/Pipe.cc: In function
'int Pipe::connect()' thread 7ff4cd784700 time 2014

04-24 11:16:13.096027
msg/Pipe.cc: 1045: FAILED assert(m)

ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
1: (Pipe::connect()+0x3cb5) [0x992435]
2: (Pipe::writer()+0x5f3) [0x992d83]
3: (Pipe::Writer::entry()+0xd) [0x99ddfd]
4: /lib64/libpthread.so.0() [0x3fb1807851]
5: (clone()+0x6d) [0x3fb14e890d]
------------------------------------------------------

There was an issue tracking this - http://tracker.ceph.com/issues/6992.

After checking the related code, I think there is still a race condition which
could lead to the assertion failure we observed. The race data flow is:
1. In the dispatcher thread, it detects there is a osd map change thus
trigger a local rebind (SimpleMessenger::rebind),
2. Within the rebind method, it first stops accepter (this is async!) then
mark down all related local pipes and connections.
3. Let accepter do rebind (create another thread binding to a different
address).

The race could happen at step #2, as the accepter stopping is async, so that
after the dispatcher thread mark down all pipes, there is a chance the accepter
thread is not yet get stopped and would still put items into the pipe queue.
When the load of the system get increased, it is more likely such race could
happen.

A simple fix of the issue, is to invoke mark down all pipes from within the
accepter thread before and after its loop.

msgr_crash.log View - osd log before crash (1020 KB) Guang Yang, 05/06/2014 08:03 PM


Related issues

Duplicated by Ceph - Bug #8497: OSDs stats getting DOWN , OUT , when i starts putting data to cluster Duplicate 06/02/2014

History

#1 Updated by Guang Yang almost 10 years ago

Please help to review the pull request - https://github.com/ceph/ceph/pull/1733

#2 Updated by Guang Yang almost 10 years ago

A simple fix of the issue, is to invoke mark down all pipes from within the
accepter thread before and after its loop.

We cannot invoke mark_down_all after the loop of the accepter thread, as we cannot guarantee if the new thread has been started or not, so that there could be a race to remove valid entries. I think it is safe (and effective) to clear all pipes from within the new accepter thread before it starts working. Please correct me if that is not true.

Thanks.

#3 Updated by Ian Colle almost 10 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Guang Yang to Greg Farnum

Greg, Please review the linked PR.

#4 Updated by Greg Farnum almost 10 years ago

  • Status changed from Fix Under Review to 15
  • Assignee changed from Greg Farnum to Ian Colle

#6992 was backported, but there's not been a Dumpling release containing the backport yet. A new patch isn't needed, just a release containing the old one.

#5 Updated by Greg Farnum almost 10 years ago

Actually Guang Yang, do you have a copy of the full backtrace you're concerned about? It's a little mangled from the Redmine markup as provided here, but I think it looks like an entirely different sort of issue. (Maybe from having to repeatedly rebind quickly enough that the Pipes haven't yet managed to exit the connect stage of one before it gets to the other?)

#6 Updated by Guang Yang almost 10 years ago

Following are some log snippets related with the ms between the two hosts before crash:

  -259> 2014-04-25 09:18:54.791225 7f778cf7e700  2 -- 10.193.207.183:6890/6044082 >> 10.193.207.178:0/52313 pipe(0x17a90000 sd=85
3 :6890 s=2 pgs=3192 cs=1 l=1 c=0x12b2f1e0).reader couldn't read tag, Success
  -258> 2014-04-25 09:18:55.059284 7f778cf7e700  2 -- 10.193.207.183:6890/6044082 >> 10.193.207.178:0/52313 pipe(0x17a90000 sd=85
3 :6890 s=2 pgs=3192 cs=1 l=1 c=0x12b2f1e0).fault 0: Success
   -95> 2014-04-25 09:18:54.923803 7f77d0b41700  2 -- 10.193.207.183:6919/6044082 >> 10.193.207.178:0/48932 pipe(0x11eccc80 sd=24
8 :6919 s=2 pgs=2452 cs=1 l=1 c=0x124278c0).reader couldn't read tag, Success
   -94> 2014-04-25 09:18:55.086365 7f77d0b41700  2 -- 10.193.207.183:6919/6044082 >> 10.193.207.178:0/48932 pipe(0x11eccc80 sd=24
8 :6919 s=2 pgs=2452 cs=1 l=1 c=0x124278c0).fault 0: Success
   -68> 2014-04-25 09:18:54.936126 7f77dc94a700  2 -- 10.193.207.183:6890/6044082 >> 10.193.207.178:0/48932 pipe(0x12a4b680 sd=23
7 :6890 s=2 pgs=2451 cs=1 l=1 c=0x133afe40).reader couldn't read tag, Success
   -67> 2014-04-25 09:18:55.087309 7f77dc94a700  2 -- 10.193.207.183:6890/6044082 >> 10.193.207.178:0/48932 pipe(0x12a4b680 sd=23
7 :6890 s=2 pgs=2451 cs=1 l=1 c=0x133afe40).fault 0: Success
   -62> 2014-04-25 09:18:54.937628 7f77cf42a700  2 -- 10.193.207.183:6889/6044082 >> 10.193.207.178:6881/4044612 pipe(0x11ef7180 
sd=259 :48784 s=4 pgs=3429 cs=1 l=0 c=0x1100ac60).reader couldn't read tag, Success
   -61> 2014-04-25 09:18:55.087593 7f77cf42a700  2 -- 10.193.207.183:6889/6044082 >> 10.193.207.178:6881/4044612 pipe(0x11ef7180 
sd=259 :48784 s=4 pgs=3429 cs=1 l=0 c=0x1100ac60).fault 0: Success
  -1> 2014-04-25 09:18:55.134989 7f77d356b700  2 -- 10.193.207.183:6889/6044082 >> 10.193.207.178:6817/5052082 pipe(0x13374500 sd=128 :50975 s=1 pgs=0 cs=0 l=0 c=0x123b4b00). got newly_acked_seq 3 vs out_seq 0
     0> 2014-04-25 09:18:55.135839 7f77d356b700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f77d356b700 time 2014-04-25 09:18:55.135010

I also have the core dump but that is GB so I need to find a way to share.

BTW, except reading code, is there any good document introducing the messenger component within CEPH?

Thanks.

#7 Updated by Greg Farnum almost 10 years ago

  • Status changed from 15 to 12
  • Assignee deleted (Ian Colle)
  • Priority changed from Normal to High

I need to study this more, but it looks like we're not correctly handling message resending during repeated reconnects. Updating the ticket to reflect that it's a new bug.

#8 Updated by Greg Farnum almost 10 years ago

Guang Yang, this is interesting:

  -1> 2014-04-25 09:18:55.134989 7f77d356b700  2 -- 10.193.207.183:6889/6044082 >> 10.193.207.178:6817/5052082 pipe(0x13374500 sd=128 :50975 s=1 pgs=0 cs=0 l=0 c=0x123b4b00). got newly_acked_seq 3 vs out_seq 0

It indicates that the other side (which we're connecting to) believes it has seen messages up to seq 3 (I think that's 4 messages received, but whatever), whereas the local Pipe thinks it hasn't sent any messages. We do reset the out_seq in a few circumstances, basically when our TCP connection has failed or we've marked our peer down. But none of those should allow the other side to have any memory of our previous connection existing. If you can reproduce this with "debug ms = 10" (or even 20) it would really help.

Otherwise, you should be able to upload the coredump with ceph-post-file, but if you're running a self-brewed version of the binaries we won't be able to do anything with it. :(

#9 Updated by Greg Farnum almost 10 years ago

Oh, and no, there's not a good way to learn about the messenger except by code reading, sorry.

#10 Updated by Guang Yang almost 10 years ago

Thanks Greg.

I upload the core dump onto dropbox so that you can access here - https://www.dropbox.com/s/35l92m13h6z203s/ceph-osd_msgr_seq_mismatch_crash_yguang.

I used to constantly reproduce this issue (system load average extremely high), however, after we reboot the box, the system load looks good and I haven't reproduced it since.

It is fine, I just started learning details of the messenger component, so I asked if there is some design doc around that. Thanks Greg!

#11 Updated by Guang Yang almost 10 years ago

 -7884> 2014-05-04 06:16:04.048605 7f7136537700  1 -- 10.193.207.180:6884/1037605 --> 10.193.207.183:6958/5001307 -- osd_map(30566..30566 src has 29862..30566) v3 -- ?+0 0x1db26240 con 0xe3e1e40
 -5368> 2014-05-04 06:16:06.024453 7f710a22c700  2 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x12686780 sd=690 :6884 s=2 pgs=14435 cs=1 l=0 c=0xe3e1e40).reader couldn't read tag, Success
 -5358> 2014-05-04 06:16:07.694577 7f710a22c700  2 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x12686780 sd=690 :6884 s=2 pgs=14435 cs=1 l=0 c=0xe3e1e40).fault 0: Success
 -5356> 2014-05-04 06:16:07.694599 7f710a22c700  0 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x12686780 sd=690 :6884 s=2 pgs=14435 cs=1 l=0 c=0xe3e1e40).fault, initiating reconnect
  -365> 2014-05-04 06:16:07.908963 7f7138d3b700  5 -- 10.193.207.180:6884/1037605 mark_down_all 10.193.207.183:6958/5001307 0x12686780
  -364> 2014-05-04 06:16:07.909102 7f7127403700  2 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x12686780 sd=63 :60500 s=4 pgs=14635 cs=3 l=0 c=0xe3e1e40).reader couldn't read tag, Success
  -363> 2014-05-04 06:16:07.909142 7f7127403700  2 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x12686780 sd=63 :60500 s=4 pgs=14635 cs=3 l=0 c=0xe3e1e40).fault 0: Success
    -2> 2014-05-04 06:16:07.957897 7f71063ee700  2 -- 10.193.207.180:6884/1037605 >> 10.193.207.183:6958/5001307 pipe(0x1cfa1400 sd=132 :60749 s=1 pgs=0 cs=0 l=0 c=0x12b4e160). got newly_acked_seq 10 vs out_seq 0

I checked a couple of crash logs and they shared the same pattern, I choose the above one to explain the problem here. Let us say S is the server (hb_messenger, cluster_messenger, etc.) and C is a messenger client.
1. At S side, it detects some issues and trigger a messenger rebind, which would stop the pipe (asyncly make the corresponding pipe reader / writer thread stop themselves), it also shutdown the socket.
2. At S side, the reader/writer thread detects something wrong with the socket when trying to do read/write, so that it would issue a re-connect. The crash happened during the reconnect.
3. If I understand correctly, at S side, it should not do re-connect in the first place after rebind herself as the connection was originally established from C side.

In order to fix this issue, at S side, each time we accept a new connection, we set the policy as server so that it disables reconnect from S side when bad things happens to the connection.

#12 Updated by Guang Yang almost 10 years ago

If the above analysis makes sense, here is the pull request: https://github.com/ceph/ceph/pull/1765

#13 Updated by Guang Yang almost 10 years ago

After understanding more of the messenger code, I found the above analysis was wrong in terms of the reconnecting logic, there is actually no client and server concept at each end of the connection, each side could initiate the reconnect once detecting a failure. Sorry about the wrong assumption.

Looking at the log, it looks like the crashed one is the very initial connection (cs = 0, pgs = 0), but actually, there were communications in between already given the log entries above.

#14 Updated by Greg Farnum almost 10 years ago

Guang Yang wrote:

After understanding more of the messenger code, I found the above analysis was wrong in terms of the reconnecting logic, there is actually no client and server concept at each end of the connection, each side could initiate the reconnect once detecting a failure. Sorry about the wrong assumption.

Just to be clear, there is a client and server concept in the messenger, but in the case of the OSDs they are peers, so either side can reconnect. Not sure if that's what you meant.

Regarding the bug: I think the "partial reset", in which we have an out_seq==0 but the other side has newly_acked_seq>0, is the biggest hint. If you have a crash log which contains that reset sequence in addition to the actual crash, it shouldn't be too difficult for us to identify the problem (assuming it has any messenger output, at least).

#15 Updated by Guang Yang almost 10 years ago

Just to be clear, there is a client and server concept in the messenger, but in the case of the OSDs they are peers, so either side can reconnect. Not sure if that's what you meant.

That is what I meant, sorry for the confusion. At the beginning, I thought there were two TCP connections between two messengers.

Unfortunately I don't have verbose messenger log when seeing the issue, I am attaching what I have as a reference.

As always, thanks Greg!

#16 Updated by Greg Farnum almost 10 years ago

Hmm, so this is also following a mark_down_all. That would certainly explain why we no longer have any data on our peers — we killed and reset everything!
But why is the peer reconnecting to us without resetting, and why are we allowing it to do so? Those are the questions to look at, and it might just be that we aren't handling "bad" peers correctly.

Update: disregard, for some reason I was thinking we were in accept rather than connect.

#17 Updated by Guang Yang almost 10 years ago

Copied an email thread talking about this issue below for reference...

On Thu, May 8, 2014 at 12:12 AM, Guang <yguang11@yahoo.com> wrote:
Thanks Sage and Greg!

I just capture a crash from which I get the logs from both side (sadly we
still use debug_ms=0 as increasing the debug_ms level dramatically consume
more disk spaces), from the log, I think I get some more clues how this
failure happened.

Let us say we have two messengers C and S, at C side, it did a rebind and
cleared everything.
The logs at C side are:
————
-6> 2014-05-07 12:29:35.362448 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). got newly_acked_seq 59 vs out_seq 0
    -5> 2014-05-07 12:29:35.362487 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). discarding previously sent 0 osd_map(36192..36196 src has
35392..36196) v3
    -4> 2014-05-07 12:29:35.362504 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). discarding previously sent 0 pg_notify(4.4d21(21) epoch 36196) v4
    -3> 2014-05-07 12:29:35.362530 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). discarding previously sent 0 pg_notify(3.4d22(21) epoch 36196) v4
    -2> 2014-05-07 12:29:35.362561 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). discarding previously sent 0 pg_query(4.3279 epoch 36196) v2
    -1> 2014-05-07 12:29:35.362571 7fcc26e8c700  2 --
10.193.207.191:6918/24089488 >> 10.193.207.182:6982/57027253 pipe(0x19cfec80
sd=139 :48953 s=1 pgs=0 cs=0 l=0 c=0x13d62c
60). discarding previously sent 0 pg_query(3.327a epoch 36196) v2
     0> 2014-05-07 12:29:35.363457 7fcc26e8c700 -1 msg/Pipe.cc: In function
'int Pipe::connect()' thread 7fcc26e8c700 time 2014-05-07 12:29:35.362587
msg/Pipe.cc: 1045: FAILED assert(m)
————
The logs at S side are:
———
2014-05-07 12:29:09.820735 7eff50794700  0 -- 10.193.207.182:6982/57027253
10.193.207.191:6918/24089488 pipe(0x170bf680 sd=63 :57524 s=2 pgs=24805
cs=1 l=0 c=0x1691b340).fault with nothing to send, going to standby
2014-05-07 12:29:35.361030 7eff40042700 10 osd.363 36196  new session
0x17d33500 con=0x17fdf080 addr=10.193.207.191:6918/24089488
2014-05-07 12:29:35.361068 7eff40042700  0 -- 10.193.207.182:6982/57027253
10.193.207.191:6918/24089488 pipe(0xfe07400 sd=343 :6982 s=0 pgs=0 cs=0
l=0 c=0x17fdf080).accept
connect_seq 0 vs existing 1 state standby
2014-05-07 12:29:35.361078 7eff40042700  0 -- 10.193.207.182:6982/57027253
10.193.207.191:6918/24089488 pipe(0xfe07400 sd=343 :6982 s=0 pgs=0 cs=0
l=0 c=0x17fdf080).accept
peer reset, then tried to connect to us, replacing
———
And it looks like the below round trips happened between the two sides.
1. C tried to establish a new connection (when it gets some message to
send) with S with cs=0 (and a new larger pgs)
2. At S side, there was an existing connection with C (cs=1, pgs=24805,
in_seq>59).
3. This is what happened at S side:
     3.1 lookup pipe and find an existing one -
https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L445
     3.2 the execution flow comes to replacing the existing pipe -
https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L473
     3.3 as this is a lossy connection, the original pipe would be cleared.
     3.4 then it comes to the open section
(https://github.com/ceph/ceph/blob/master/src/msg/Pipe.cc#L649) and echo
back a reply with tag CEPH_MSGR_TAG_SEQ along with the seq from original
pipe
4. At C side
4.1 as this is a brand new pipe so that out_seq is zero, however, the above
message comes with along a original newly_ack_seq, this bring the assertion
failure we observed.

If this makes sense, at step 3.4, we properly should not echo original seq
if this is a branch new connection attempt (cs = 0).

Does this look like reasonable?

Ah! Yes, you're exactly right. This error makes sense; too — the
CEPH_MSGR_TAG_SEQ is much newer than the rest of the reconnect logic.

On Thu, May 8, 2014 at 4:58 AM, Guang <yguang11@yahoo.com> wrote:
I think the next question is, why C failed to find a PIPE (locally) for S
but there is actually one PIPE existing at S side (with C), is it possible
that C unregister the pipe but S has not been aware yet?

Exactly. In the other logs I've seen, C has done a mark_down_all()
(which kills every single pipe) while doing a rebind(); generally this
is as a consequence of getting an OSDMap which says it's down. If S
has not yet seen this map, it won't have nuked the Connection on its
side.

Would you like to prepare a patch, or should I? :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

#18 Updated by Guang Yang almost 10 years ago

Hi Greg,
Please help to review the new PR - https://github.com/ceph/ceph/pull/1794

#19 Updated by Greg Farnum almost 10 years ago

  • Status changed from 12 to Pending Backport
  • Assignee set to Greg Farnum

bdee119076dd0eb65334840d141ccdf06091e3c9 needs to be backported to everything once it's passed through some testing.

#20 Updated by Greg Farnum almost 10 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF