Project

General

Profile

Actions

Bug #10080

closed

Pipe::connect() cause osd crash when osd reconnect to its peer

Added by Wenjun Huang over 9 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
giant, firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When our cluster load is heavy, the osd sometimes crashes. The critical log is as below:

278> 2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 OSD::ms_get_authorizer type=osd
-277> 2014-08-20 11:04:28.609783 7f89636c8700 2 -
10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). got newly_acked_seq 546 vs out_seq 0
276> 2014-08-20 11:04:28.609810 7f89636c8700 2 - 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding previously sent 1 osd_map(727..755 src has 1..755) v3
275> 2014-08-20 11:04:28.609859 7f89636c8700 2 - 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=105 :42657 s=1 pgs=236754 cs=4 l=0 c=0x44318c0). discarding previously sent 2 pg_notify(1.2b(22),2.2c(23) epoch 755) v5

2014-08-20 11:04:28.608141 7f89629bb700 0 -- 10.193.207.117:6816/44281 >> 10.193.207.125:6804/2022817 pipe(0x7ef2280 sd=134 :6816 s=2 pgs=236754 cs=3 l=0 c=0x44318c0).fault, initiating reconnect
2014-08-20 11:04:28.609192 7f89636c8700 10 osd.11 783 OSD::ms_get_authorizer type=osd
2014-08-20 11:04:28.666895 7f89636c8700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7f89636c8700 time 2014-08-20 11:04:28.618536 msg/Pipe.cc: 1080: FAILED assert(m)

Looking into the log, we can see, the out_seq is 0. As our cluster has enabled the cephx authorization, from the source code, I am informed that the out_seq is initialized by a random number. So there should be some bugs in the source code.

We face the crash issue, almost every time our cluster load is heavy. So, I think it is a critical bug for ceph.

Actions

Also available in: Atom PDF