Bug #16010
closed[mds]failed to decode message of type 43 v7: buffer::end_of_buffer
0%
Description
When listing a directory on CephFS with lots of files(3M+ files), the MDS will hit a bug that cannot decode MOSDOpReply. The log with debug_ms 1 in MDS is below.
It can be reproduced by "ls" in my enviroment, we use ceph-fuse as client, all components are 10.2.1.
2016-05-23 23:04:24.892328 7f0deedf5700 -1 failed to decode message of type 43 v7: buffer::end_of_buffer 2016-05-23 23:04:24.892361 7f0deedf5700 1 dump: 00000000 14 00 00 00 31 30 30 30 30 30 30 30 30 30 62 2e |....1000000000b.| 00000010 30 30 30 30 30 30 30 30 01 04 00 00 00 00 00 00 |00000000........| 00000020 00 e9 4a 2f fc ff ff ff ff 1d 00 40 01 00 00 00 |..J/.......@....| 00000030 00 00 00 00 00 e1 01 00 00 00 00 00 00 00 00 00 |................| 00000040 00 e7 0c 00 00 03 00 00 00 13 12 00 00 00 00 00 |................| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 00 00 00 00 00 00 00 00 00 00 00 02 01 00 00 12 |................| 00000070 12 00 00 00 00 00 00 00 00 00 00 00 00 10 00 00 |................| 00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000090 00 27 15 a6 80 01 13 02 00 00 00 06 00 00 00 45 |.'.............E| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000b0 00 00 00 00 00 00 00 45 00 00 00 12 00 00 00 00 |.......E........| 000000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000d0 00 00 00 00 00 00 00 e1 01 00 00 00 00 00 00 00 |................| 000000e0
corresponding request is
2016-05-23 23:03:44.596543 7f0dfb25e700 1 -- 10.136.168.54:6800/16714 --> 10.189.11.107:6864/4831 -- osd_op(mds.0.780:62 4.fc2f4ae9 1000000000b.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent] snapc 0=[] RETRY=18 ack+retry+read+known_if_redirected+full_force e3303) v7 -- ?+0 0x7f0e0d68a680 con 0x7f0e0d65b000
Files
Updated by Kefu Chai almost 8 years ago
(gdb) bt #0 0x00007fffe6828458 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007fffe68298da in __GI_abort () at abort.c:89 #2 0x00007fffe713df6d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007fffe713bf36 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007fffe713bf81 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007fffe713c199 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007fffebe49be0 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffffffacf0, len=258, dest=...) at /var/ceph/ceph/src/common/buffer.cc:1151 #7 0x00007fffebe41e04 in ceph::buffer::list::iterator::copy (this=0x7fffffffacf0, len=258, dest=...) at /var/ceph/ceph/src/common/buffer.cc:1259 #8 0x00000000014f85b2 in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /var/ceph/ceph/src/osd/osd_types.cc:5567 #9 0x000000000124fbb6 in MOSDOpReply::decode_payload (this=0x2bbad00) at /var/ceph/ceph/src/messages/MOSDOpReply.h:230 #10 0x00000000011e5625 in main (argc=7, argv=0x7fffffffe0d8) at /var/ceph/ceph/src/test/encoding/ceph_dencoder.cc:389 (gdb) f 8 #8 0x00000000014f85b2 in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /var/ceph/ceph/src/osd/osd_types.cc:5567 5567 datap.copy(ops[i].op.payload_len, ops[i].outdata); (gdb) l 5562 void OSDOp::split_osd_op_vector_out_data(vector<OSDOp>& ops, bufferlist& in) 5563 { 5564 bufferlist::iterator datap = in.begin(); 5565 for (unsigned i = 0; i < ops.size(); i++) { 5566 if (ops[i].op.payload_len) { 5567 datap.copy(ops[i].op.payload_len, ops[i].outdata); 5568 } 5569 } 5570 } 5571 ### (gdb) p ops $7 = std::vector of length 3, capacity 3 = {{op = {op = {v = 4627}, flags = {v = 0}, {extent = {offset = {v = 0}, length = {v = 0}, truncate_size = {v = 0}, truncate_seq = {v = 0}}, xattr = {name_len = {v = 0}, value_len = {v = 0}, cmp_op = 0 '\000', ////.... (gdb) p ops[0].op.payload_len $1 = {v = 258} (gdb) p ops[1].op.payload_len $2 = {v = 2158368039} (gdb) p/x ops[1].op.payload_len $3 = {v = 0x80a61527} (gdb) p/x ops[2].op.payload_len $4 = {v = 0x45} (gdb)
xiaoxi, could you apply following patch and post the dump of the "data" field also as you did? thanks!
diff --git a/src/msg/Message.cc b/src/msg/Message.cc index f84004b..30f82a7 100644 --- a/src/msg/Message.cc +++ b/src/msg/Message.cc @@ -770,6 +770,8 @@ Message *decode_message(CephContext *cct, int crcflags, ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump: \n"; m->get_payload().hexdump(*_dout); *_dout << dendl; + m->get_data().hexdump(*_dout); + *_dout << dendl; if (cct->_conf->ms_die_on_bad_msg) assert(0); }
Updated by Xiaoxi Chen almost 8 years ago
Kefu:
I can compile the code now, but seems the hexdump of data doesn't work.
diff --git a/src/msg/Message.cc b/src/msg/Message.cc index f84004b..101cba6 100644 --- a/src/msg/Message.cc +++ b/src/msg/Message.cc @@ -758,6 +758,7 @@ Message *decode_message(CephContext *cct, int crcflags, m->set_payload(front); m->set_middle(middle); m->set_data(data); + lderr(cct) << "data ptr: " << data << dendl; try { m->decode_payload(); @@ -767,9 +768,12 @@ Message *decode_message(CephContext *cct, int crcflags, lderr(cct) << "failed to decode message of type " << type << " v" << header.version << ": " << e.what() << dendl; - ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump: \n"; + ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump payload: \n"; m->get_payload().hexdump(*_dout); *_dout << dendl; + ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump data: \n"; + m->get_data().hexdump(*_dout); + *_dout << dendl;
But the data bufferlist is printed out, see below, strangely it do have 2GB in length
2016-05-24 20:15:17.873000 7f90436fb700 -1 data ptr: buffer::list(len=2158368366, buffer::ptr(0~2158366720 0x7f906a8a4000 in raw 0x7f906a8a4000 len 2158366720 nref 2), buffer::ptr(0~1646 0x7f9067a44e00 in raw 0x7f9067a44e00 len 1646 nref 2) ) 2016-05-24 20:15:17.873233 7f90436fb700 -1 failed to decode message of type 43 v7: buffer::end_of_buffer 2016-05-24 20:15:17.873238 7f90436fb700 1 dump payload: 00000000 14 00 00 00 31 30 30 30 30 30 30 30 30 30 62 2e |....1000000000b.| 00000010 30 30 30 30 30 30 30 30 01 04 00 00 00 00 00 00 |00000000........| 00000020 00 e9 4a 2f fc ff ff ff ff 15 00 40 01 00 00 00 |..J/.......@....| 00000030 00 00 00 00 00 e1 01 00 00 00 00 00 00 00 00 00 |................| 00000040 00 f1 0c 00 00 03 00 00 00 13 12 00 00 00 00 00 |................| 00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000060 00 00 00 00 00 00 00 00 00 00 00 02 01 00 00 12 |................| 00000070 12 00 00 00 00 00 00 00 00 00 00 00 00 10 00 00 |................| 00000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00000090 00 27 15 a6 80 01 13 02 00 00 00 06 00 00 00 45 |.'.............E| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000b0 00 00 00 00 00 00 00 45 00 00 00 00 00 00 00 00 |.......E........| 000000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000d0 00 00 00 00 00 00 00 e1 01 00 00 00 00 00 00 00 |................| 000000e0
Updated by Kefu Chai almost 8 years ago
per xiaoxi
I do get 2GB hexdump in mds side....
with the patch above we have the hexdump of the m->get_data()
.
09fe6f10 00 00 00 00 00 00 00 00 00 00 00 fe ff ff ff ff |................| 09fe6f20 ff ff ff 00 00 00 00 0c 00 00 00 31 33 33 32 34 |...........13324| 09fe6f30 36 37 5f 68 65 61 64 b2 01 00 00 02 00 00 00 00 |67_head.........| 09fe6f40 00 00 00 49 0d 06 87 01 00 00 84 02 15 00 00 01 |...I............| 09fe6f50 00 00 00 00 00 00 72 aa 3e 57 06 64 c1 11 a4 81 |......r.>W.d....| 09fe6f60 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 |................| 09fe6f70 00 00 00 00 00 00 00 02 02 18 00 00 00 00 00 40 |...............@| 09fe6f80 00 01 00 00 00 00 00 40 00 05 00 00 00 00 00 00 |.......@........| 09fe6f90 00 00 00 00 00 00 00 02 00 00 00 00 00 01 00 00 |................| 09fe6fa0 00 ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 |................| 09fe6fb0 00 00 00 00 00 72 aa 3e 57 36 02 d0 11 72 aa 3e |.....r.>W6...r.>| 09fe6fc0 57 06 64 c1 11 00 00 00 00 00 00 00 00 02 02 20 |W.d............ | 09fe6fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 09fe6ff0 00 00 00 03 02 38 00 00 00 00 00 00 00 00 00 00 |.....8..........| 09fe7000 00 00 00 02 00 00 00 00 00 01 00 00 00 00 00 00 |................| 09fe7010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 09fe7030 00 03 02 38 00 00 00 00 00 00 00 00 00 00 00 00 |...8............| 09fe7040 00 02 00 00 00 00 00 01 00 00 00 00 00 00 00 00 |................| 09fe7050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 09fe7060 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a2 |................| 09fe7070 77 55 00 00 00 00 00 00 00 00 00 00 00 00 00 01 |wU..............| 09fe7080 00 00 00 00 00 00 00 ec 55 55 00 00 00 00 00 00 |........UU......| 09fe7090 00 00 00 00 00 00 00 00 00 00 00 ff ff ff ff ff |................| 09fe70a0 ff ff ff 00 00 00 00 01 01 10 00 00 00 00 00 00 |................| 09fe70b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 09fe70e0 00 fe ff ff ff ff ff ff ff 00 00 00 00 0c 00 00 |................| 09fe70f0 00 31 33 33 32 34 36 38 5f 68 65 61 64 b2 01 00 |.1332468_head...| 09fe7100 00 02 00 00 00 00 00 00 00 49 0d 06 87 01 00 00 |.........I......| 09fe7110 85 02 15 00 00 01 00 00 00 00 00 00 72 aa 3e 57 |............r.>W| 09fe7120 e3 63 e1 11 a4 81 00 00 00 00 00 00 00 00 00 00 |.c..............| 09fe7130 01 00 00 00 00 00 00 00 00 00 00 00 00 02 02 18 |................| 09fe7140 00 00 00 00 00 40 00 01 00 00 00 00 00 40 00 05 |.....@.......@..| 09fe7150 00 00 00 00 00 00 00 00 00 00 00 00 00 02 00 00 |................| 09fe7160 00 00 00 01 00 00 00 ff ff ff ff ff ff ff ff 00 |................| 09fe7170 00 00 00 00 00 00 00 00 00 00 00 72 aa 3e 57 d2 |...........r.>W.| 09fe7180 49 f1 11 72 aa 3e 57 e3 63 e1 11 00 00 00 00 00 |I..r.>W.c.......| 09fe7190 00 00 00 02 02 20 00 00 00 00 00 00 00 00 00 00 |..... ..........| 09fe71a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 09fe71b0 00 00 00 00 00 00 00 00 00 03 02 38 00 00 00 00 |...........8....| 09fe71c0 00 00 00 00 00 00 00 00 00 02 00 00 00 00 00 01 |................| 09fe71d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *
per xiaoxi, the values of the encoded kv pair of omap is the inodes. and the dumped keys looks right. the size of the hexdump of m->get_data() matches what gdb printed in the comment before the previous one.
(gdb) p ops[0].op.payload_len $1 = {v = 258} (gdb) p ops[1].op.payload_len $2 = {v = 2158368039} (gdb) p/x ops[1].op.payload_len $3 = {v = 0x80a61527} (gdb) p/x ops[2].op.payload_len $4 = {v = 0x45} (gdb)
so i think the received MOSDOpReply is very likely a legitimate one.
questions:- why we have such a huge data payload? we might want pose a smaller limit here instead of the upper limit of uint64. please note that we encode the size of a bufferlist using uint32, see "encoding.h". so any number greater than 0xffffffff is not a valid "max_return". we could have an assert in librados or somewhere.
CDir::_omap_fetch() { // ... if (keys.empty()) { assert(!c); rd.omap_get_vals("", "", (uint64_t)-1, &fin->omap, &fin->ret2); } else { // ...
- why the decode fails. maybe we can also walk thru
MOSDOpReply::decode_payload()
to see if we have any wrong assumption that could cause the decode error.
Updated by Kefu Chai almost 8 years ago
- Project changed from Ceph to CephFS
- Status changed from New to 12
- Assignee changed from Kefu Chai to Zheng Yan
- Source changed from other to Community (dev)
Updated by Greg Farnum almost 8 years ago
So we actually have so many dentries in the folder that >2GB of omap seems reasonable? I could certainly see how that would break things, but if we're asserting out inside the MOSDOpReply message decode then I think we have a bug with RADOS as well.
2158368039 is not right on the 2GB mark (10884391 extra bytes over the limit), so it's not a simple 31-bit (/32-bit signed) thing with the data bufferlist itself; do we have any clue if the message was actually sent out correctly?
Updated by Xiaoxi Chen almost 8 years ago
Greg Farnum wrote:
So we actually have so many dentries in the folder that >2GB of omap seems reasonable? I could certainly see how that would break things, but if we're asserting out inside the MOSDOpReply message decode then I think we have a bug with RADOS as well.
2158368039 is not right on the 2GB mark (10884391 extra bytes over the limit), so it's not a simple 31-bit (/32-bit signed) thing with the data bufferlist itself; do we have any clue if the message was actually sent out correctly?
We was tried to do hexdump but that is too big so only first 10% is analysised. But can try to use fwrite to write the binary directly.
Also curious the max size of "omap" dir in all of my OSDs is ~ 400MB+, how can we generate 2GB+ data from one omap query? is that because leveldb did some compression?
Updated by Zheng Yan almost 8 years ago
(gdb) bt #0 ceph::buffer::end_of_buffer::end_of_buffer (this=0xbce18b0) at /home/zhyan/Ceph/ceph/src/include/buffer.h:82 #1 0x0000000001bfc27f in ceph::buffer::list::iterator_impl<false>::advance (this=0x7fffe97eb550, o=-2026078208) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1077 #2 0x0000000001bfc940 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffe97eb550, len=64, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1159 #3 0x0000000001bf3e76 in ceph::buffer::list::iterator::copy (this=0x7fffe97eb550, len=2268888894, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1259 #4 0x0000000001ba6d7a in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /home/zhyan/Ceph/ceph/src/osd/osd_types.cc:5567 #5 0x0000000001aa3096 in MOSDOpReply::decode_payload (this=0xbd13400) at /home/zhyan/Ceph/ceph/src/messages/MOSDOpReply.h:230 #6 0x0000000001a96671 in decode_message (cct=0xbc7e000, crcflags=3, header=..., footer=..., front=..., middle=..., data=...) at /home/zhyan/Ceph/ceph/src/msg/Message.cc:763 #7 0x0000000001d44d7c in Pipe::read_message (this=0xbcf2800, pm=0x7fffe97ebfe0, auth_handler=0xbc8b500) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.cc:2094 #8 0x0000000001d3fa6f in Pipe::reader (this=0xbcf2800) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.cc:1606 #9 0x0000000001d48572 in Pipe::Reader::entry (this=0xbcf2818) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.h:52 #10 0x0000000001c009a9 in Thread::entry_wrapper (this=0xbcf2818) at /home/zhyan/Ceph/ceph/src/common/Thread.cc:87 #11 0x0000000001c008de in Thread::_entry_func (arg=0xbcf2818) at /home/zhyan/Ceph/ceph/src/common/Thread.cc:67 #12 0x00007ffff631f60a in start_thread () from /lib64/libpthread.so.0 #13 0x00007ffff4b5059d in clone () from /lib64/libc.so.6 (gdb) f 1 #1 0x0000000001bfc27f in ceph::buffer::list::iterator_impl<false>::advance (this=0x7fffe97eb550, o=-2026078208) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1077 1077 throw end_of_buffer(); (gdb) p o $27 = -2026078208 (gdb) f 2 #2 0x0000000001bfc940 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffe97eb550, len=64, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1159 1159 advance(howmuch); (gdb) p howmuch $28 = 2268888830 (gdb)
void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G
Updated by Zheng Yan almost 8 years ago
- Project changed from CephFS to Ceph
- Assignee deleted (
Zheng Yan)
Updated by Xiaoxi Chen almost 8 years ago
Zheng Yan wrote:
[...]
void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G
@Zheng, but we still need to fix the mismatch between max message size ( 2GB for now and 4GB if we fix this bug, int->uint) and the requested max size in cdir(UINT64-1), right?
Updated by Zheng Yan almost 8 years ago
Xiaoxi Chen wrote:
Zheng Yan wrote:
[...]
void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G
@Zheng, but we still need to fix the mismatch between max message size ( 2GB for now and 4GB if we fix this bug, int->uint) and the requested max size in cdir(UINT64-1), right?
it's more like omap_get_vals() issue.
Updated by Kefu Chai almost 8 years ago
- Status changed from 12 to 4
Updated by Kefu Chai almost 8 years ago
- Status changed from 4 to Fix Under Review
Updated by Kefu Chai almost 8 years ago
- Status changed from Fix Under Review to Resolved