Project

General

Profile

Actions

Bug #14266

closed

"FAILED assert(existing->state == STATE_CONNECTING)" in rados-jewel-distro-basic-smithi

Added by Yuri Weinstein over 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Run: http://pulpito.ceph.com/teuthology-2016-01-05_19:00:02-rados-jewel-distro-basic-smithi/
Job: 14107
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-01-05_19:00:02-rados-jewel-distro-basic-smithi/14107/teuthology.log

2016-01-05T20:32:23.206 INFO:teuthology.orchestra.run.smithi011.stderr:marked out osd.4.
2016-01-05T20:32:24.698 INFO:tasks.ceph.osd.3.smithi024.stderr:2016-01-05 23:32:24.695849 7f5626130700 -1 -- 172.21.15.24:6805/9648 >> 172.21.15.11:6807/25321 pipe(0x7f565b78d000 sd=181 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f565a3a3a80).accept race bad state, would send wait, existing=closing 0x7f565d4d2000.cseq=138 == 138
2016-01-05T20:32:24.709 INFO:tasks.ceph.osd.3.smithi024.stderr:msg/simple/Pipe.cc: In function 'int Pipe::accept()' thread 7f5626130700 time 2016-01-05 23:32:24.695857
2016-01-05T20:32:24.709 INFO:tasks.ceph.osd.3.smithi024.stderr:msg/simple/Pipe.cc: 561: FAILED assert(existing->state == STATE_CONNECTING)
2016-01-05T20:32:24.712 INFO:tasks.ceph.osd.3.smithi024.stderr: ceph version 10.0.1-618-gd0f4d2d (d0f4d2d6343b53ba1e992ff241e57d9e0e3a1c45)
2016-01-05T20:32:24.712 INFO:tasks.ceph.osd.3.smithi024.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f564f56b6d5]
2016-01-05T20:32:24.712 INFO:tasks.ceph.osd.3.smithi024.stderr: 2: (Pipe::accept()+0x4888) [0x7f564f64d0f8]
2016-01-05T20:32:24.713 INFO:tasks.ceph.osd.3.smithi024.stderr: 3: (Pipe::reader()+0x1c56) [0x7f564f64f6a6]
2016-01-05T20:32:24.713 INFO:tasks.ceph.osd.3.smithi024.stderr: 4: (Pipe::Reader::entry()+0xd) [0x7f564f657b3d]
2016-01-05T20:32:24.713 INFO:tasks.ceph.osd.3.smithi024.stderr: 5: (()+0x7df5) [0x7f564d5b0df5]
2016-01-05T20:32:24.713 INFO:tasks.ceph.osd.3.smithi024.stderr: 6: (clone()+0x6d) [0x7f564be591ad]
2016-01-05T20:32:24.714 INFO:tasks.ceph.osd.3.smithi024.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by Samuel Just over 8 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Sage Weil about 8 years ago

  • Priority changed from Urgent to High
Actions #3

Updated by Kefu Chai about 8 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Haomai Wang
Actions #4

Updated by Greg Farnum about 8 years ago

So....I went to dig into this and try to see how we hit the bug — Haomai is correct in that PR that we can get into this state in a very specific connection failure condition, but I don't think that's the bug we hit in this particular case.

Unfortunately, the daemon logs have been cleaned up on teuthology. I don't suppose anybody has a copy?

Actions #5

Updated by Greg Farnum about 8 years ago

Transferring some of the discussion from the PR:

gregsfortytwo commented
Can you explain how we get into this situation?
I think if we get into that block we aren't allowed to accept the connection as it's been explicitly marked down and the other side hasn't acknowledged that yet.

yuyuyu101 commented
@gregsfortytwo if osd side firstly accept one connection and register pipe, but failed at https://github.com/ceph/ceph/blob/master/src/msg/simple/Pipe.cc#L732 . while this connection isn't a lossy connection, it won't unregister pipe. Then we accept a new connection and lookup this pipe, we will go through this condition

gregsfortytwo commented
This is still in my queue for deeper review and I'm planning to go through a bunch of PRs today so I hope to reach it.
But my instinct without digging too far into it, is that this is too broad and violates our mark_down rules. If we've got potential failures during connection setup that result in this state, we probably need to handle them there. Don't take that as gospel, but off-hand I'm pretty confident. :)

yuyuyu101 commented
@gregsfortytwo yes, I agree these changes like a dirty fix. The main problem is I don't have brave to refactor too much in Pipe, it's really a adventure... If we want to clean these unexpected state changes, the lock usage need to be refactor but it is so disordered now!

Actions #6

Updated by Greg Farnum about 8 years ago

Okay, so to reach this point, we know:
  • connect.global_seq >= existing->peer_global_seq
  • it's not lossy (well, duh, I'm pretty sure this is an OSD<->OSD connection)
  • the peer did not reset
  • the peer has the same connect_seq we do
  • we are not in STATE_OPEN or STATE_STANDBY

Note that in Pipe::accept(), if we are replacing another Pipe, we go into STATE_STANDBY, not STATE_CLOSED, and in this bug we have cseq 138 — so we didn't hit a tcp error during initial setup and then go to STATE_CLOSED. [1]

What this all says to me is that the other side is doing things it's not allowed to do and we are responding appropriately. :/ Given the OSD thrashing happening, and the cseq, and that it looks as though this is the pipe to an OSD which the thrasher had just marked down...my guess is that we either have a new SimpleMessenger bug in mark_down handling, or that the AsyncMessenger isn't handling it properly in some way. (I think we've fixed several subtle SimpleMessenger bugs since the AsyncMessenger took off its code, so it isn't necessarily a newly-introduced one.)

[1]: And if we did hit a tcp read/write error during initial setup, the other side will as well, and do a fault() with all the seq changes that implies. I don't think we have a problem there.

Actions #7

Updated by Greg Farnum about 8 years ago

...hang on, I thought this test was run with random messenger types but I can't find any evidence of that now. So it must be the SimpleMessenger misbehaving on the other end.

I wish we still had the OSD logs. :( :( :(

Actions #8

Updated by Greg Farnum about 8 years ago

  • Status changed from Fix Under Review to 12
  • Assignee deleted (Haomai Wang)
Actions #9

Updated by Haomai Wang about 8 years ago

@Greg Farnum I missed the last reply..

I'm not sure you noticed this log, it looks which is I'm missing:
2016-01-05T20:32:24.698 INFO:tasks.ceph.osd.3.smithi024.stderr:2016-01-05 23:32:24.695849 7f5626130700 1 - 172.21.15.24:6805/9648 >> 172.21.15.11:6807/25321 pipe(0x7f565b78d000 sd=181 :6805 s=0 pgs=0 cs=0 l=0 c=0x7f565a3a3a80).accept race bad state, would send wait, existing=closing 0x7f565d4d2000.cseq=138 138

existing->state closing... which shouldn't existed...

It's a bit flip?

Actions #10

Updated by Greg Farnum almost 8 years ago

@Haomai, oh, now I did too. I don't have this paged into my brain any more, can you expand on "which shouldn't have existed"?

Actions #11

Updated by Haomai Wang almost 8 years ago

because no code with make pipe state == STATE_CLOSING. Plz help to check code if I'm wrong

Actions #12

Updated by Greg Farnum about 7 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF