Project

General

Profile

Bug #38524

AsyncConnection: segmentation fault

Added by Patrick Donnelly 4 months ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
-
Category:
AsyncMessenger
Target version:
Start date:
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous,mimic
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

 ceph version 14.1.0-226-g95e025a (95e025a951c5dc39da24f635d362b7eb88407f65) nautilus (dev)
 1: (()+0xf5d0) [0x7f20098e95d0]
 2: (AsyncConnection::read_until(unsigned int, char*)+0x3f1) [0x7f200bcb67a1]
 3: (AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>)+0x4e) [0x7f200bcb6f5e]
 4: (ProtocolV1::read(CtFun<ProtocolV1, char*, int>*, int, char*)+0x5d) [0x7f200bcc731d]
 5: (ProtocolV1::wait_connect_message_auth()+0x5e) [0x7f200bcd550e]
 6: (ProtocolV1::handle_connect_message_1(char*, int)+0xa5) [0x7f200bce0be5]
 7: (ProtocolV1::read_event()+0x10c) [0x7f200bcc9c2c]
 8: (AsyncConnection::process()+0x424) [0x7f200bcb7ed4]
 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1017) [0x7f200bd09f87]
 10: (()+0x54d495) [0x7f200bd0e495]
 11: (()+0x7dc0ff) [0x7f200bf9d0ff]
 12: (()+0x7dd5) [0x7f20098e1dd5]
 13: (clone()+0x6d) [0x7f2008591ead]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

From: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/log/ceph-mds.c.log.gz

Core: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/coredump/1551374591.15562.core


Related issues

Copied to Messengers - Backport #38644: luminous: AsyncConnection: segmentation fault New
Copied to Messengers - Backport #38645: mimic: AsyncConnection: segmentation fault New

History

#1 Updated by Ricardo Dias 4 months ago

There's a strange behavior in the log just before the segfault.

The peer that is trying to connect to this MDS is always sending a bad `protocotol_version`:

019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept of host_type 0, policy.lossy=1 policy.server=0 policy.standby=0 policy.resetcheck=0 features 0x800000000
2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0

## next reply from peer

2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 33554433

## next reply from peer

2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0

I tried to check the log of the peer (v1:172.21.15.13:0/3427847655) but the logs directory of that machine was empty.

#2 Updated by Patrick Donnelly 4 months ago

Same job failed in another run, so it looks reproducible: /ceph/teuthology-archive/pdonnell-2019-03-01_18:13:24-multimds-wip-pdonnell-testing-20190301.150213-distro-basic-smithi/3654808/teuthology.log

#3 Updated by Sage Weil 4 months ago

  • Project changed from Ceph to RADOS
  • Category deleted (msgr)

#4 Updated by Sage Weil 4 months ago

  • Status changed from New to In Progress

i think this will fix it, but we need to be able to reproduce first to test...

https://github.com/ceph/ceph/pull/26803

#5 Updated by Sage Weil 4 months ago

#6 Updated by Sage Weil 3 months ago

  • Status changed from In Progress to Pending Backport
  • Backport set to luminous,mimic

#7 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #38644: luminous: AsyncConnection: segmentation fault added

#8 Updated by Nathan Cutler 3 months ago

  • Copied to Backport #38645: mimic: AsyncConnection: segmentation fault added

#9 Updated by Greg Farnum 3 months ago

  • Project changed from RADOS to Messengers

#10 Updated by Greg Farnum 3 months ago

  • Category set to AsyncMessenger

Also available in: Atom PDF