Bug #10080: Pipe::connect() cause osd crash when osd reconnect to its peer - Messengers - Ceph

Actions

Copy link

Bug #10080

closed

Pipe::connect() cause osd crash when osd reconnect to its peer

Added by Wenjun Huang over 9 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

giant, firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When our cluster load is heavy, the osd sometimes crashes. The critical log is as below:

278> 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 OSD::ms_get_authorizer type=osd
-277> 2014-08-20 11:04:28.609783 7f89636c8700 2 - 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). got newly_acked_seq 546 vs out_seq 0
~~276> 2014-08-20 11:04:28.609810 7f89636c8700 2 -~~ 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding previously sent 1 osd_map(727..755 src has 1..755) v3
~~275> 2014-08-20 11:04:28.609859 7f89636c8700 2 -~~ 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding previously sent 2 pg_notify(1.2b(22),2.2c(23) epoch 755) v5

2014-08-20 11:04:28.608141 7f89629bb700 0 -- 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=134 :6816 s=2 pgs=236754 cs=3 l=0 c=0x44318c0).fault, initiating reconnect
2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 OSD::ms_get_authorizer type=osd
2014-08-20 11:04:28.666895 7f89636c8700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f89636c8700 time 2014-08-20 11:04:28.618536 msg/Pipe.cc: 1080: FAILED assert(m)

Looking into the log, we can see, the out_seq is 0. As our cluster has enabled the cephx authorization, from the source code, I am informed that the out_seq is initialized by a random number. So there should be some bugs in the source code.

We face the crash issue, almost every time our cluster load is heavy. So, I think it is a critical bug for ceph.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Messengers

Custom queries

Bug #10080

Pipe::connect() cause osd crash when osd reconnect to its peer

Updated by Wenjun Huang over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Wenjun Huang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Haomai Wang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Greg Farnum over 9 years ago

Updated by Greg Farnum about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Loïc Dachary about 9 years ago

Updated by Greg Farnum about 5 years ago