Project

General

Profile

Actions

Bug #13985

closed

Messenger: large message data content length cause int overflow

Added by huang jun over 8 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD aborted when sending message:
1103> 2015-12-04 16:12:23.691914 7f9282bfe700 1 - 172.19.10.1:6801/186389 --> 172.19.10.1:6800/11348 -- osd_op_reply(5959811 100000003ec.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent (125)] v0'0 uv49527 ondisk = 0) v6 -- ?+0 0x7f927a456600 con 0x7f9298825180
1102> 2015-12-04 16:12:23.691968 7f9282bfe700 20 - 172.19.10.1:6801/186389 submit_message osd_op_reply(5959811 100000003ec.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent (125)] v0'0 uv49527 ondisk = 0) v6 remote, 172.19.10.1:6800/11348, have pipe.
1100> 2015-12-04 16:12:23.692073 7f9293df6700 10 - 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).writer: state = open policy.server=1
1099> 2015-12-04 16:12:23.692099 7f9293df6700 20 - 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).writer encoding 1 features 1125899906842623 0x7f927a456600 osd_op_reply(5959811 100000003ec.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent (125)] v0'0 uv49527 ondisk = 0) v6
2015-12-04 16:12:26.840045 7f9293df6700 20 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).writer signed seq # 1): sig = 0
2015-12-04 16:12:27.661121 7f9293df6700 20 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).writer sending 1 0x7f927a456600
2015-12-04 16:12:27.661125 7f9282bfe700 20 -- 172.19.10.1:6801/186389 submit_message osd_op_reply(5959824 100002b8ee5.00000000 [getxattr parent (169)] v0'0 uv30057 ondisk = 0) v6 remote, 172.19.10.1:6800/11348, have pipe.
2015-12-04 16:12:27.661176 7f9293ef7700 2 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).reader couldn't read tag, (0) Success
2015-12-04 16:12:27.661200 7f9293ef7700 2 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).fault (0) Success
2015-12-04 16:12:27.661238 7f9293ef7700 10 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).fault on lossy channel, failing2015-12-04 16:12:27.661249 7f9293ef7700 10 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=2 pgs=4363 cs=1 l=1 c=0x7f9298825180).stop
2015-12-04 16:12:27.661258 7f9293ef7700 10 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=4 pgs=4363 cs=1 l=1 c=0x7f9298825180).unregister_pipe
2015-12-04 16:12:27.661264 7f9293ef7700 10 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=4 pgs=4363 cs=1 l=1 c=0x7f9298825180).discard_queue
2015-12-04 16:12:27.661268 7f9293ef7700 20 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=4 pgs=4363 cs=1 l=1 c=0x7f9298825180). discard 0x7f927a456c00
2015-12-04 16:12:27.661286 7f9293ef7700 10 -- 172.19.10.1:6801/186389 >> 172.19.10.1:6800/11348 pipe(0x7f9298820000 sd=131 :6801 s=4 pgs=4363 cs=1 l=1 c=0x7f9298825180).reader done
2015-12-04 18:07:53.709777 7f20b2775700 20 -- 172.19.10.1:6841/119941 >> 172.19.10.2:6800/11446 pipe(0x19616000 sd=158 :6841 s=2 pgs=655 cs=1 l=1 c=0x194b0d60).writer sending 1 0x1a009180 blist 2178765965
2015-12-04 18:07:53.709809 7f20b2775700 0 -- 172.19.10.1:6841/119941 >> 172.19.10.2:6800/11446 pipe(0x19616000 sd=158 :6841 s=2 pgs=655 cs=1 l=1 c=0x194b0d60). left -2116201331
2015-12-04 16:12:27.662109 7f9293df6700 -1 msg/simple/Pipe.cc: In function 'int Pipe::write_message(ceph_msg_header&, ceph_msg_footer&, ceph::bufferlist&)' thread

7f9293df6700 time 2015-12-04 16:12:27.661148
msg/simple/Pipe.cc: 2327: FAILED assert(left == 0)

After install package ceph-debuginfo,i reproduce the bug, and gdb stack info below:
(gdb) bt#0 0x000000315f60f5db in raise () from /lib64/libpthread.so.0#1 0x0000000000a48906 in reraise_fatal (signum=6) at global/signal_handler.cc:59#2

handle_fatal_signal (signum=6) at global/signal_handler.cc:109#3 <signal handler called>
#4 0x000000315f232925 in raise () from /lib64/libc.so.6
#5 0x000000315f234105 in abort () from /lib64/libc.so.6
#6 0x00000031616bea5d in _gnu_cxx::_verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7 0x00000031616bcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#8 0x00000031616bcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9 0x00000031616bcd0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x0000000000accfda in ceph::
_ceph_assert_fail (assertion=0x7f927fc67980 "\353JaV\273f\331-", file=<value optimized out>,
line=-729399040, func=0xda59e0 "int Pipe::write_message(ceph_msg_header&, ceph_msg_footer&, ceph::bufferlist&)")
at common/assert.cc:77
#11 0x0000000000c2142c in Pipe::write_message (this=0x7f9298820000, header=<value optimized out>, footer=..., blist=...)
at msg/simple/Pipe.cc:2327
#12 0x0000000000c37a35 in Pipe::writer (this=0x7f9298820000) at msg/simple/Pipe.cc:1828
#13 0x0000000000c3ab4d in Pipe::Writer::entry (this=<value optimized out>) at msg/simple/Pipe.h:62
#14 0x000000315f6079d1 in start_thread () from /lib64/libpthread.so.0
#15 0x000000315f2e8b6d in clone () from /lib64/libc.so.6

Then i add some debug info to print the 'left' value, i got this:
2015-12-05 11:30:09.298753 7faa42076700 20 -- 172.19.10.1:6841/255898 >> 172.19.10.2:6800/11446 pipe(0x2087f000 sd=162 :6841 s=2 pgs=12117 cs=1 l=1 c=0x186c9020).writer signed seq # 1): sig = 0
2015-12-05 11:30:09.298787 7faa42076700 20 -- 172.19.10.1:6841/255898 >> 172.19.10.2:6800/11446 pipe(0x2087f000 sd=162 :6841 s=2 pgs=12117 cs=1 l=1 c=0x186c9020).writer sending 1 0x3ccc940 blist 271
2015-12-05 11:30:09.298792 7faa42076700 20 -- 172.19.10.1:6841/255898 >> 172.19.10.2:6800/11446 pipe(0x2087f000 sd=162 :6841 s=2 pgs=12117 cs=1 l=1 c=0x186c9020).writer sending 1 0x3ccc940 blist 271
2015-12-05 11:30:10.014508 7faa42076700 20 -- 172.19.10.1:6841/255898 >> 172.19.10.2:6800/11446 pipe(0x2087f000 sd=162 :6841 s=2 pgs=12117 cs=1 l=1 c=0x186c9020).writer sending 1 0x3ccc940 blist 2178765965
2015-12-05 11:30:10.014545 7faa42076700 0 -- 172.19.10.1:6841/255898 >> 172.19.10.2:6800/11446 pipe(0x2087f000 sd=162 :6841 s=2 pgs=12117 cs=1 l=1 c=0x186c9020). left -2116201331
2015-12-05 11:30:10.015764 7faa42076700 -1 msg/simple/Pipe.cc: In function 'int Pipe::write_message(ceph_msg_header&,ceph_msg_footer&, ceph::bufferlist&)' thread

7faa42076700 time 2015-12-05 11:30:10.014567
msg/simple/Pipe.cc: 2330: FAILED assert(left == 0)

Summary:
1) we got 2178765965 bytes data in message, but in Pipe.cc::write_message() function

// payload (front+data)
list<bufferptr>::const_iterator pb = blist.buffers().begin();
int b_off = 0; // carry-over buffer offset, if any
int bl_pos = 0; // blist pos
int left = blist.length();

'int' from -2147483648 to 2147483647
'unsigned int' from 0 to 4294967295
we set "unsigned int" to "int", that result the assert(left == 0) failed
i changed it to:
// payload (front+data)
list<bufferptr>::const_iterator pb = blist.buffers().begin();
unsigned b_off = 0; // carry-over buffer offset, if any
unsigned bl_pos = 0; // blist pos
unsigned left = blist.length()

it now works.

Actions #1

Updated by Haomai Wang over 8 years ago

  • Category set to msgr
  • Status changed from New to In Progress
  • Source changed from other to Community (dev)
Actions #2

Updated by Haomai Wang over 8 years ago

  • Subject changed from OSD aborted due to large message data content length to Messenger: large message data content length cause int overflow
Actions #3

Updated by Greg Farnum over 8 years ago

Fixing this overflow is fine, but...what are you doing that you have a 2GB message? There's no way that's a safe thing to do to the system as a whole.

Actions #4

Updated by huang jun over 8 years ago

we setup a cluster, and make 4 million dirs in fs, which result mds hang, then we restart mds, it sends getomapvals to osd, and about 2GB content in that message. we don't know why the omapval so large.

Actions #5

Updated by Greg Farnum over 8 years ago

Oh. That would happen if you create extremely large directories without enabling the directory fragmentation. It's disabled by default because it's not as stable as we'd like it to be for a default setting — I'd recommend not creating directories with that many entries right now, but if you need to, you should set "mds bal frag = true" in your MDS config.

Actions #6

Updated by Sage Weil over 8 years ago

  • Status changed from In Progress to Resolved
Actions #7

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to Messengers
  • Category deleted (msgr)
Actions

Also available in: Atom PDF