Bug #8232: Race condition during messenger rebind - Ceph - Ceph

Actions

Copy link

Bug #8232

closed

Race condition during messenger rebind

Added by Guang Yang about 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

High

Assignee:

Greg Farnum

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When the system is in high load, we observed an assertion failure as below:
-----------------------------------------------
~~11> 2014-04-24 11:16:13.124561 7ff4ecd78700 0 -~~ 10.193.207.178:6909/1048024

10.193.207.177:6849/27341 pipe(0xc12d180 sd

=55 :52037 s=2 pgs=1734 cs=11 l=0 c=0x13cf62c0).fault, initiating reconnect
~~10> 2014-04-24 11:16:13.124613 7ff4fe1a2700 1 -~~ 10.193.207.178:0/48024
mark_down 0xb0b8840 -- pipe dne
~~9> 2014-04-24 11:16:13.124632 7ff4fe1a2700 1 -~~ 10.193.207.178:0/48024
mark_down 0xb0b8580 -- pipe dne
~~8> 2014-04-24 11:16:13.124629 7ff4e600b700 0 -~~
10.193.207.178:6909/1048024 >> 10.193.207.186:6809/5357 pipe(0xc007b80 sd=
21 :39223 s=2 pgs=1693 cs=15 l=0 c=0x13fdfa20).fault, initiating reconnect
7> 2014-04-24 11:16:13.124624 7ff4cd784700 -1 msg/Pipe.cc: In function
'int Pipe::connect()' thread 7ff4cd784700 time 2014
04-24 11:16:13.096027
msg/Pipe.cc: 1045: FAILED assert(m)

ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
 1: (Pipe::connect()+0x3cb5) [0x992435]
 2: (Pipe::writer()+0x5f3) [0x992d83]
 3: (Pipe::Writer::entry()+0xd) [0x99ddfd]
 4: /lib64/libpthread.so.0() [0x3fb1807851]
 5: (clone()+0x6d) [0x3fb14e890d]
------------------------------------------------------

There was an issue tracking this - http://tracker.ceph.com/issues/6992.

After checking the related code, I think there is still a race condition which
could lead to the assertion failure we observed. The race data flow is:
1. In the dispatcher thread, it detects there is a osd map change thus
trigger a local rebind (SimpleMessenger::rebind),
2. Within the rebind method, it first stops accepter (this is async!) then
mark down all related local pipes and connections.
3. Let accepter do rebind (create another thread binding to a different
address).

The race could happen at step #2, as the accepter stopping is async, so that
after the dispatcher thread mark down all pipes, there is a chance the accepter
thread is not yet get stopped and would still put items into the pipe queue.
When the load of the system get increased, it is more likely such race could
happen.

A simple fix of the issue, is to invoke mark down all pipes from within the
accepter thread before and after its loop.

Files

msgr_crash.log (1020 KB) msgr_crash.log

osd log before crash

Guang Yang, 05/06/2014 08:03 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #8232

Race condition during messenger rebind

Updated by Guang Yang about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Ian Colle about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Guang Yang about 10 years ago

Updated by Greg Farnum about 10 years ago

Updated by Guang Yang almost 10 years ago

Updated by Guang Yang almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Greg Farnum almost 10 years ago