Bug #38524
closedAsyncConnection: segmentation fault
0%
Description
ceph version 14.1.0-226-g95e025a (95e025a951c5dc39da24f635d362b7eb88407f65) nautilus (dev) 1: (()+0xf5d0) [0x7f20098e95d0] 2: (AsyncConnection::read_until(unsigned int, char*)+0x3f1) [0x7f200bcb67a1] 3: (AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>)+0x4e) [0x7f200bcb6f5e] 4: (ProtocolV1::read(CtFun<ProtocolV1, char*, int>*, int, char*)+0x5d) [0x7f200bcc731d] 5: (ProtocolV1::wait_connect_message_auth()+0x5e) [0x7f200bcd550e] 6: (ProtocolV1::handle_connect_message_1(char*, int)+0xa5) [0x7f200bce0be5] 7: (ProtocolV1::read_event()+0x10c) [0x7f200bcc9c2c] 8: (AsyncConnection::process()+0x424) [0x7f200bcb7ed4] 9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1017) [0x7f200bd09f87] 10: (()+0x54d495) [0x7f200bd0e495] 11: (()+0x7dc0ff) [0x7f200bf9d0ff] 12: (()+0x7dd5) [0x7f20098e1dd5] 13: (clone()+0x6d) [0x7f2008591ead] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
From: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/log/ceph-mds.c.log.gz
Core: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/coredump/1551374591.15562.core
Updated by Ricardo Dias about 5 years ago
There's a strange behavior in the log just before the segfault.
The peer that is trying to connect to this MDS is always sending a bad `protocotol_version`:
019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept of host_type 0, policy.lossy=1 policy.server=0 policy.standby=0 policy.resetcheck=0 features 0x800000000 2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0 ## next reply from peer 2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 33554433 ## next reply from peer 2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0
I tried to check the log of the peer (v1:172.21.15.13:0/3427847655) but the logs directory of that machine was empty.
Updated by Patrick Donnelly about 5 years ago
Same job failed in another run, so it looks reproducible: /ceph/teuthology-archive/pdonnell-2019-03-01_18:13:24-multimds-wip-pdonnell-testing-20190301.150213-distro-basic-smithi/3654808/teuthology.log
Updated by Sage Weil about 5 years ago
- Project changed from Ceph to RADOS
- Category deleted (
msgr)
Updated by Sage Weil about 5 years ago
- Status changed from New to In Progress
i think this will fix it, but we need to be able to reproduce first to test...
Updated by Sage Weil about 5 years ago
trying to reproduce: http://pulpito.ceph.com/sage-38524-a/
Updated by Sage Weil about 5 years ago
- Status changed from In Progress to Pending Backport
- Backport set to luminous,mimic
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #38644: luminous: AsyncConnection: segmentation fault added
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #38645: mimic: AsyncConnection: segmentation fault added
Updated by Greg Farnum about 5 years ago
- Project changed from RADOS to Messengers
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".