Bug #38524
closed
AsyncConnection: segmentation fault
Added by Patrick Donnelly about 5 years ago.
Updated over 3 years ago.
Description
ceph version 14.1.0-226-g95e025a (95e025a951c5dc39da24f635d362b7eb88407f65) nautilus (dev)
1: (()+0xf5d0) [0x7f20098e95d0]
2: (AsyncConnection::read_until(unsigned int, char*)+0x3f1) [0x7f200bcb67a1]
3: (AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>)+0x4e) [0x7f200bcb6f5e]
4: (ProtocolV1::read(CtFun<ProtocolV1, char*, int>*, int, char*)+0x5d) [0x7f200bcc731d]
5: (ProtocolV1::wait_connect_message_auth()+0x5e) [0x7f200bcd550e]
6: (ProtocolV1::handle_connect_message_1(char*, int)+0xa5) [0x7f200bce0be5]
7: (ProtocolV1::read_event()+0x10c) [0x7f200bcc9c2c]
8: (AsyncConnection::process()+0x424) [0x7f200bcb7ed4]
9: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x1017) [0x7f200bd09f87]
10: (()+0x54d495) [0x7f200bd0e495]
11: (()+0x7dc0ff) [0x7f200bf9d0ff]
12: (()+0x7dd5) [0x7f20098e1dd5]
13: (clone()+0x6d) [0x7f2008591ead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
From: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/log/ceph-mds.c.log.gz
Core: /ceph/teuthology-archive/pdonnell-2019-02-28_08:49:12-multimds-wip-pdonnell-testing-20190228.054239-distro-basic-smithi/3649663/remote/smithi200/coredump/1551374591.15562.core
There's a strange behavior in the log just before the segfault.
The peer that is trying to connect to this MDS is always sending a bad `protocotol_version`:
019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept of host_type 0, policy.lossy=1 policy.server=0 policy.standby=0 policy.resetcheck=0 features 0x800000000
2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0
## next reply from peer
2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 33554433
## next reply from peer
2019-02-28 17:23:11.836 7f20057a8700 10 --1- [v2:172.21.15.200:6826/1397651624,v1:172.21.15.200:6827/1397651624] >> v1:172.21.15.13:0/3427847655 conn(0x557747a95000 0x557746bf4100 :6827 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_message_2 accept my proto 32, their proto 0
I tried to check the log of the peer (v1:172.21.15.13:0/3427847655) but the logs directory of that machine was empty.
Same job failed in another run, so it looks reproducible: /ceph/teuthology-archive/pdonnell-2019-03-01_18:13:24-multimds-wip-pdonnell-testing-20190301.150213-distro-basic-smithi/3654808/teuthology.log
- Project changed from Ceph to RADOS
- Category deleted (
msgr)
- Status changed from New to In Progress
- Status changed from In Progress to Pending Backport
- Backport set to luminous,mimic
- Copied to Backport #38644: luminous: AsyncConnection: segmentation fault added
- Copied to Backport #38645: mimic: AsyncConnection: segmentation fault added
- Project changed from RADOS to Messengers
- Category set to AsyncMessenger
- Pull request ID set to 26803
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".
Also available in: Atom
PDF