Project

General

Profile

Actions

Bug #8232

closed

Race condition during messenger rebind

Added by Guang Yang about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When the system is in high load, we observed an assertion failure as below:
-----------------------------------------------
11> 2014-04-24 11:16:13.124561 7ff4ecd78700 0 - 10.193.207.178:6909/1048024

10.193.207.177:6849/27341 pipe(0xc12d180 sd

=55 :52037 s=2 pgs=1734 cs=11 l=0 c=0x13cf62c0).fault, initiating reconnect
10> 2014-04-24 11:16:13.124613 7ff4fe1a2700 1 - 10.193.207.178:0/48024
mark_down 0xb0b8840 -- pipe dne
9> 2014-04-24 11:16:13.124632 7ff4fe1a2700 1 - 10.193.207.178:0/48024
mark_down 0xb0b8580 -- pipe dne
8> 2014-04-24 11:16:13.124629 7ff4e600b700 0 -
10.193.207.178:6909/1048024 >> 10.193.207.186:6809/5357 pipe(0xc007b80 sd=
21 :39223 s=2 pgs=1693 cs=15 l=0 c=0x13fdfa20).fault, initiating reconnect
7> 2014-04-24 11:16:13.124624 7ff4cd784700 -1 msg/Pipe.cc: In function
'int Pipe::connect()' thread 7ff4cd784700 time 2014

04-24 11:16:13.096027
msg/Pipe.cc: 1045: FAILED assert(m)

ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
1: (Pipe::connect()+0x3cb5) [0x992435]
2: (Pipe::writer()+0x5f3) [0x992d83]
3: (Pipe::Writer::entry()+0xd) [0x99ddfd]
4: /lib64/libpthread.so.0() [0x3fb1807851]
5: (clone()+0x6d) [0x3fb14e890d]
------------------------------------------------------

There was an issue tracking this - http://tracker.ceph.com/issues/6992.

After checking the related code, I think there is still a race condition which
could lead to the assertion failure we observed. The race data flow is:
1. In the dispatcher thread, it detects there is a osd map change thus
trigger a local rebind (SimpleMessenger::rebind),
2. Within the rebind method, it first stops accepter (this is async!) then
mark down all related local pipes and connections.
3. Let accepter do rebind (create another thread binding to a different
address).

The race could happen at step #2, as the accepter stopping is async, so that
after the dispatcher thread mark down all pipes, there is a chance the accepter
thread is not yet get stopped and would still put items into the pipe queue.
When the load of the system get increased, it is more likely such race could
happen.

A simple fix of the issue, is to invoke mark down all pipes from within the
accepter thread before and after its loop.


Files

msgr_crash.log (1020 KB) msgr_crash.log osd log before crash Guang Yang, 05/06/2014 08:03 PM

Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #8497: OSDs stats getting DOWN , OUT , when i starts putting data to clusterDuplicate06/02/2014

Actions
Actions

Also available in: Atom PDF