Project

General

Profile

Bug #36167

msg: infinite recursion while reading messages

Added by Patrick Donnelly 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
09/24/2018
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

(gdb) bt 20
#0  0x00007f80cb003d97 in ProtocolV1::handle_message_footer(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#1  0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#2  0x00007f80cb001403 in ProtocolV1::read_message_footer() () from /usr/lib/ceph/libceph-common.so.0
#3  0x00007f80cb001628 in ProtocolV1::read_message_data() () from /usr/lib/ceph/libceph-common.so.0
#4  0x00007f80cb001941 in ProtocolV1::read_message_data_prepare() () from /usr/lib/ceph/libceph-common.so.0
#5  0x00007f80cb001e68 in ProtocolV1::read_message_middle() () from /usr/lib/ceph/libceph-common.so.0
#6  0x00007f80cb003483 in ProtocolV1::handle_message_front(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#7  0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#8  0x00007f80cb00201b in ProtocolV1::read_message_front() () from /usr/lib/ceph/libceph-common.so.0
#9  0x00007f80cb002268 in ProtocolV1::throttle_dispatch_queue() () from /usr/lib/ceph/libceph-common.so.0
#10 0x00007f80cb0025ca in ProtocolV1::throttle_bytes() () from /usr/lib/ceph/libceph-common.so.0
#11 0x00007f80cb0029b1 in ProtocolV1::throttle_message() () from /usr/lib/ceph/libceph-common.so.0
#12 0x00007f80cb002fb4 in ProtocolV1::handle_message_header(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#13 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#14 0x00007f80cb000018 in ProtocolV1::handle_message(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#15 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#16 0x00007f80cafff684 in ProtocolV1::wait_message() () from /usr/lib/ceph/libceph-common.so.0
#17 0x00007f80cb004402 in ProtocolV1::handle_message_footer(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#18 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#19 0x00007f80cb001403 in ProtocolV1::read_message_footer() () from /usr/lib/ceph/libceph-common.so.0

(gdb) bt -30
#51961 0x00007f80cb002268 in ProtocolV1::throttle_dispatch_queue() () from /usr/lib/ceph/libceph-common.so.0
#51962 0x00007f80cb0025ca in ProtocolV1::throttle_bytes() () from /usr/lib/ceph/libceph-common.so.0
#51963 0x00007f80cb0029b1 in ProtocolV1::throttle_message() () from /usr/lib/ceph/libceph-common.so.0
#51964 0x00007f80cb002fb4 in ProtocolV1::handle_message_header(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51965 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51966 0x00007f80cb000018 in ProtocolV1::handle_message(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51967 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51968 0x00007f80cafff684 in ProtocolV1::wait_message() () from /usr/lib/ceph/libceph-common.so.0
#51969 0x00007f80cb004402 in ProtocolV1::handle_message_footer(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51970 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51971 0x00007f80cb001403 in ProtocolV1::read_message_footer() () from /usr/lib/ceph/libceph-common.so.0
#51972 0x00007f80cb001628 in ProtocolV1::read_message_data() () from /usr/lib/ceph/libceph-common.so.0
#51973 0x00007f80cb001941 in ProtocolV1::read_message_data_prepare() () from /usr/lib/ceph/libceph-common.so.0
#51974 0x00007f80cb001e68 in ProtocolV1::read_message_middle() () from /usr/lib/ceph/libceph-common.so.0
#51975 0x00007f80cb003483 in ProtocolV1::handle_message_front(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51976 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51977 0x00007f80cb00201b in ProtocolV1::read_message_front() () from /usr/lib/ceph/libceph-common.so.0
#51978 0x00007f80cb002268 in ProtocolV1::throttle_dispatch_queue() () from /usr/lib/ceph/libceph-common.so.0
#51979 0x00007f80cb0025ca in ProtocolV1::throttle_bytes() () from /usr/lib/ceph/libceph-common.so.0
#51980 0x00007f80cb0029b1 in ProtocolV1::throttle_message() () from /usr/lib/ceph/libceph-common.so.0
#51981 0x00007f80cb002fb4 in ProtocolV1::handle_message_header(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51982 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51983 0x00007f80cb000018 in ProtocolV1::handle_message(char*, int) () from /usr/lib/ceph/libceph-common.so.0
#51984 0x00007f80cafef14c in AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) () from /usr/lib/ceph/libceph-common.so.0
#51985 0x00007f80caff0148 in AsyncConnection::process() () from /usr/lib/ceph/libceph-common.so.0
#51986 0x00007f80cb020a75 in EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) () from /usr/lib/ceph/libceph-common.so.0
#51987 0x00007f80cb028fda in ?? () from /usr/lib/ceph/libceph-common.so.0
#51988 0x00007f80ca16a733 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#51989 0x00007f80ca6456db in start_thread (arg=0x7f80c4ef6700) at pthread_create.c:463
#51990 0x00007f80c982688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

MDS of course segfaulted.

Job: /ceph/teuthology-archive/pdonnell-2018-09-23_19:22:20-multimds-wip-pdonnell-testing-20180923.160923-distro-basic-smithi/3062099

Core: /ceph/teuthology-archive/pdonnell-2018-09-23_19:22:20-multimds-wip-pdonnell-testing-20180923.160923-distro-basic-smithi/3062099/remote/smithi082/coredump/1537811426.14800.core

History

#1 Updated by Patrick Donnelly 3 months ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent

#2 Updated by Patrick Donnelly 3 months ago

Another: /ceph/teuthology-archive/pdonnell-2018-09-23_19:22:20-multimds-wip-pdonnell-testing-20180923.160923-distro-basic-smithi/3062102/remote/smithi085/coredump/1537811733.35246.core

#3 Updated by Ricardo Dias 3 months ago

  • Status changed from New to In Progress
  • Assignee set to Ricardo Dias

The problem is that there are always messages to read in the socket and the message handling is implemented has a sequence of recursive functions, which after some time will fill up the stack and causing a crash.

The previous implementation used a "while loop" with a switch inside and hence no stack overflow would happen.

I'll create a patch to fix this.

#4 Updated by Patrick Donnelly 3 months ago

Ricardo Dias wrote:

The problem is that there are always messages to read in the socket and the message handling is implemented has a sequence of recursive functions, which after some time will fill up the stack and causing a crash.

I'm looking at the code and don't see a termination condition for the recursive calls. In particular, ProtocolV1::handle_message_footer always calls ProtocolV1::wait_message which always reads data from the connection and calls back into ProtocolV1::handle_message, starting the whole stack over again. What am I missing?

#5 Updated by Ricardo Dias 2 months ago

Patrick Donnelly wrote:

I'm looking at the code and don't see a termination condition for the recursive calls. In particular, ProtocolV1::handle_message_footer always calls ProtocolV1::wait_message which always reads data from the connection and calls back into ProtocolV1::handle_message, starting the whole stack over again. What am I missing?

In every READ call, if the socket does not have enough bytes to read, control flow is broken and a callback is registered by the connection to call when all the bytes requested are available. Also, if the state has changed to something different than OPENED, the recursive cycle will also be broken when ProtocolV1::wait_message executes.

#6 Updated by Ricardo Dias 2 months ago

  • Status changed from In Progress to Need Review

#7 Updated by Ricardo Dias 2 months ago

@Patrick can you test the patch in the PR to see if it fixes the problem?

#8 Updated by Patrick Donnelly 2 months ago

Ricardo Dias wrote:

@Patrick can you test the patch in the PR to see if it fixes the problem?

ACK, will do

#9 Updated by Ilya Dryomov 2 months ago

FWIW this is the tip of the stack trace that I hit:

#0  0x00007fe545afa18e in _IO_vfprintf_internal (s=s@entry=0x7fe542ee01a0, format=format@entry=0x7fe545c3ff20 <fmt> "%u.%u.%u.%u", ap=ap@entry=0x7fe542ee02c8) at vfprintf.c:1267
#1  0x00007fe545b1d1cb in __IO_vsprintf (string=0x7fe542ee0410 "", format=0x7fe545c3ff20 <fmt> "%u.%u.%u.%u", args=args@entry=0x7fe542ee02c8) at iovsprintf.c:42
#2  0x00007fe545b029c7 in __sprintf (s=s@entry=0x7fe542ee0410 "", format=format@entry=0x7fe545c3ff20 <fmt> "%u.%u.%u.%u") at sprintf.c:32
#3  0x00007fe545bd260f in inet_ntop4 (size=1025, dst=0x7fe542ee0ad0 "", src=0x7fe542ee0414 "") at inet_ntop.c:92
#4  __GI_inet_ntop (af=af@entry=2, src=src@entry=0xb58c0dc, dst=dst@entry=0x7fe542ee0ad0 "", size=size@entry=1025) at inet_ntop.c:63
#5  0x00007fe545bce65e in gni_host_inet_numeric (tmpbuf=0x7fe542ee0640, addrlen=<optimized out>, flags=3, hostlen=1025, host=0x7fe542ee0ad0 "", sa=0xb58c0d8) at getnameinfo.c:369
#6  gni_host_inet (addrlen=<optimized out>, flags=3, hostlen=1025, host=0x7fe542ee0ad0 "", sa=0xb58c0d8, tmpbuf=0x7fe542ee0640) at getnameinfo.c:392
#7  gni_host (addrlen=<optimized out>, flags=3, hostlen=1025, host=0x7fe542ee0ad0 "", sa=0xb58c0d8, tmpbuf=0x7fe542ee0640) at getnameinfo.c:425
#8  __GI_getnameinfo (sa=sa@entry=0xb58c0d8, addrlen=<optimized out>, host=host@entry=0x7fe542ee0ad0 "", hostlen=hostlen@entry=1025, serv=serv@entry=0x7fe542ee0ab0 "", servlen=servlen@entry=32, flags=3) at getnameinfo.c:542
#9  0x000000000111e28a in operator<< (out=..., sa=sa@entry=0xb58c0d8) at /build/ceph-14.0.0-3768-g49f4614/src/msg/msg_types.cc:214
#10 0x000000000111e57f in operator<< (out=..., addr=...) at /build/ceph-14.0.0-3768-g49f4614/src/msg/msg_types.cc:178
#11 0x0000000001283b10 in operator<< (av=..., out=...) at /build/ceph-14.0.0-3768-g49f4614/src/msg/msg_types.h:619
#12 DispatchQueue::pre_dispatch (this=this@entry=0xc0aaa00, m=...) at /build/ceph-14.0.0-3768-g49f4614/src/msg/DispatchQueue.cc:41
#13 0x0000000001285d44 in DispatchQueue::fast_dispatch (this=0xc0aaa00, m=...) at /build/ceph-14.0.0-3768-g49f4614/src/msg/DispatchQueue.cc:70
#14 0x00000000012c1065 in DispatchQueue::fast_dispatch (m=0xed323c0, this=<optimized out>) at /build/ceph-14.0.0-3768-g49f4614/src/msg/DispatchQueue.h:204
#15 ProtocolV1::handle_message_footer (this=0xebd7c00, buffer=<optimized out>, r=<optimized out>) at /build/ceph-14.0.0-3768-g49f4614/src/msg/async/Protocol.cc:1000
#16 0x00000000012b616c in std::function<void (char*, long)>::operator()(char*, long) const (__args#1=<optimized out>, __args#0=<optimized out>, this=0x7fe542ee13c0) at /usr/include/c++/7/bits/std_function.h:706
#17 AsyncConnection::read(unsigned int, char*, std::function<void (char*, long)>) (this=this@entry=0xc570300, len=len@entry=21, buffer=0xce6f000 "%\262\337\352", callback=...) at /build/ceph-14.0.0-3768-g49f4614/src/msg/async/AsyncConnection.cc:176

I was looking at it in a small terminal pane, didn't see (didn't bother to check) anything past frame 17 out of 64265 and as a result spent quite some time trying to figure out why getnameinfo() would segfault on a perfectly valid AF_INET sockaddr...

#10 Updated by Sage Weil 2 months ago

  • Status changed from Need Review to Resolved
  • Backport deleted (mimic,luminous)

no backport needed.. this bug is in the protocol refactor for nautilus

Also available in: Atom PDF