Project

General

Profile

Actions

Bug #16010

closed

[mds]failed to decode message of type 43 v7: buffer::end_of_buffer

Added by Xiaoxi Chen almost 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When listing a directory on CephFS with lots of files(3M+ files), the MDS will hit a bug that cannot decode MOSDOpReply. The log with debug_ms 1 in MDS is below.

It can be reproduced by "ls" in my enviroment, we use ceph-fuse as client, all components are 10.2.1.

2016-05-23 23:04:24.892328 7f0deedf5700 -1 failed to decode message of type 43 v7: buffer::end_of_buffer
2016-05-23 23:04:24.892361 7f0deedf5700  1 dump:
00000000  14 00 00 00 31 30 30 30  30 30 30 30 30 30 62 2e  |....1000000000b.|
00000010  30 30 30 30 30 30 30 30  01 04 00 00 00 00 00 00  |00000000........|
00000020  00 e9 4a 2f fc ff ff ff  ff 1d 00 40 01 00 00 00  |..J/.......@....|
00000030  00 00 00 00 00 e1 01 00  00 00 00 00 00 00 00 00  |................|
00000040  00 e7 0c 00 00 03 00 00  00 13 12 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 02 01 00 00 12  |................|
00000070  12 00 00 00 00 00 00 00  00 00 00 00 00 10 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 27 15 a6 80 01 13 02  00 00 00 06 00 00 00 45  |.'.............E|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  00 00 00 00 00 00 00 45  00 00 00 12 00 00 00 00  |.......E........|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  00 00 00 00 00 00 00 e1  01 00 00 00 00 00 00 00  |................|
000000e0

corresponding request is

2016-05-23 23:03:44.596543 7f0dfb25e700  1 -- 10.136.168.54:6800/16714 --> 10.189.11.107:6864/4831 -- osd_op(mds.0.780:62 4.fc2f4ae9 1000000000b.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent] snapc 0=[] RETRY=18 ack+retry+read+known_if_redirected+full_force e3303) v7 -- ?+0 0x7f0e0d68a680 con 0x7f0e0d65b000

Files

payload.bin (224 Bytes) payload.bin the payload dump Kefu Chai, 05/24/2016 01:42 PM
Actions #1

Updated by Kefu Chai almost 8 years ago

Actions #2

Updated by Kefu Chai almost 8 years ago

  • Description updated (diff)
Actions #3

Updated by Kefu Chai almost 8 years ago

(gdb) bt
#0  0x00007fffe6828458 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007fffe68298da in __GI_abort () at abort.c:89
#2  0x00007fffe713df6d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007fffe713bf36 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007fffe713bf81 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007fffe713c199 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007fffebe49be0 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffffffacf0, len=258, dest=...) at /var/ceph/ceph/src/common/buffer.cc:1151
#7  0x00007fffebe41e04 in ceph::buffer::list::iterator::copy (this=0x7fffffffacf0, len=258, dest=...) at /var/ceph/ceph/src/common/buffer.cc:1259
#8  0x00000000014f85b2 in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /var/ceph/ceph/src/osd/osd_types.cc:5567
#9  0x000000000124fbb6 in MOSDOpReply::decode_payload (this=0x2bbad00) at /var/ceph/ceph/src/messages/MOSDOpReply.h:230
#10 0x00000000011e5625 in main (argc=7, argv=0x7fffffffe0d8) at /var/ceph/ceph/src/test/encoding/ceph_dencoder.cc:389
(gdb) f 8
#8  0x00000000014f85b2 in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /var/ceph/ceph/src/osd/osd_types.cc:5567
5567          datap.copy(ops[i].op.payload_len, ops[i].outdata);
(gdb) l
5562    void OSDOp::split_osd_op_vector_out_data(vector<OSDOp>& ops, bufferlist& in)
5563    {
5564      bufferlist::iterator datap = in.begin();
5565      for (unsigned i = 0; i < ops.size(); i++) {
5566        if (ops[i].op.payload_len) {
5567          datap.copy(ops[i].op.payload_len, ops[i].outdata);
5568        }
5569      }
5570    }
5571

### 
(gdb) p ops
$7 = std::vector of length 3, capacity 3 = {{op = {op = {v = 4627}, flags = {v = 0}, {extent = {offset = {v = 0}, length = {v = 0}, truncate_size = {v = 0}, truncate_seq = {v = 0}}, xattr = {name_len = {v = 0}, value_len = {v = 0}, cmp_op = 0 '\000',
////....
(gdb) p ops[0].op.payload_len
$1 = {v = 258}
(gdb) p ops[1].op.payload_len
$2 = {v = 2158368039}
(gdb) p/x ops[1].op.payload_len
$3 = {v = 0x80a61527}
(gdb) p/x ops[2].op.payload_len
$4 = {v = 0x45}
(gdb)

xiaoxi, could you apply following patch and post the dump of the "data" field also as you did? thanks!

diff --git a/src/msg/Message.cc b/src/msg/Message.cc
index f84004b..30f82a7 100644
--- a/src/msg/Message.cc
+++ b/src/msg/Message.cc
@@ -770,6 +770,8 @@ Message *decode_message(CephContext *cct, int crcflags,
       ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump: \n";
       m->get_payload().hexdump(*_dout);
       *_dout << dendl;
+      m->get_data().hexdump(*_dout);
+      *_dout << dendl;
       if (cct->_conf->ms_die_on_bad_msg)
     assert(0);
     }

Actions #4

Updated by Xiaoxi Chen almost 8 years ago

Kefu:
I can compile the code now, but seems the hexdump of data doesn't work.


diff --git a/src/msg/Message.cc b/src/msg/Message.cc
index f84004b..101cba6 100644
--- a/src/msg/Message.cc
+++ b/src/msg/Message.cc
@@ -758,6 +758,7 @@ Message *decode_message(CephContext *cct, int crcflags,
   m->set_payload(front);
   m->set_middle(middle);
   m->set_data(data);
+  lderr(cct) << "data ptr: " << data << dendl;

   try {
     m->decode_payload();
@@ -767,9 +768,12 @@ Message *decode_message(CephContext *cct, int crcflags,
       lderr(cct) << "failed to decode message of type " << type
                 << " v" << header.version
                 << ": " << e.what() << dendl;
-      ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump: \n";
+      ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump payload: \n";
       m->get_payload().hexdump(*_dout);
       *_dout << dendl;
+      ldout(cct, cct->_conf->ms_dump_corrupt_message_level) << "dump data: \n";
+      m->get_data().hexdump(*_dout);
+      *_dout << dendl;


But the data bufferlist is printed out, see below, strangely it do have 2GB in length


2016-05-24 20:15:17.873000 7f90436fb700 -1 data ptr: buffer::list(len=2158368366,
    buffer::ptr(0~2158366720 0x7f906a8a4000 in raw 0x7f906a8a4000 len 2158366720 nref 2),
    buffer::ptr(0~1646 0x7f9067a44e00 in raw 0x7f9067a44e00 len 1646 nref 2)
)
2016-05-24 20:15:17.873233 7f90436fb700 -1 failed to decode message of type 43 v7: buffer::end_of_buffer
2016-05-24 20:15:17.873238 7f90436fb700  1 dump payload:
00000000  14 00 00 00 31 30 30 30  30 30 30 30 30 30 62 2e  |....1000000000b.|
00000010  30 30 30 30 30 30 30 30  01 04 00 00 00 00 00 00  |00000000........|
00000020  00 e9 4a 2f fc ff ff ff  ff 15 00 40 01 00 00 00  |..J/.......@....|
00000030  00 00 00 00 00 e1 01 00  00 00 00 00 00 00 00 00  |................|
00000040  00 f1 0c 00 00 03 00 00  00 13 12 00 00 00 00 00  |................|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 00 00 00  00 00 00 02 01 00 00 12  |................|
00000070  12 00 00 00 00 00 00 00  00 00 00 00 00 10 00 00  |................|
00000080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000090  00 27 15 a6 80 01 13 02  00 00 00 06 00 00 00 45  |.'.............E|
000000a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  00 00 00 00 00 00 00 45  00 00 00 00 00 00 00 00  |.......E........|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  00 00 00 00 00 00 00 e1  01 00 00 00 00 00 00 00  |................|
000000e0

Actions #5

Updated by Kefu Chai almost 8 years ago

per xiaoxi

I do get 2GB hexdump in mds side....

with the patch above we have the hexdump of the m->get_data().

09fe6f10  00 00 00 00 00 00 00 00  00 00 00 fe ff ff ff ff  |................|
09fe6f20  ff ff ff 00 00 00 00 0c  00 00 00 31 33 33 32 34  |...........13324|
09fe6f30  36 37 5f 68 65 61 64 b2  01 00 00 02 00 00 00 00  |67_head.........|
09fe6f40  00 00 00 49 0d 06 87 01  00 00 84 02 15 00 00 01  |...I............|
09fe6f50  00 00 00 00 00 00 72 aa  3e 57 06 64 c1 11 a4 81  |......r.>W.d....|
09fe6f60  00 00 00 00 00 00 00 00  00 00 01 00 00 00 00 00  |................|
09fe6f70  00 00 00 00 00 00 00 02  02 18 00 00 00 00 00 40  |...............@|
09fe6f80  00 01 00 00 00 00 00 40  00 05 00 00 00 00 00 00  |.......@........|
09fe6f90  00 00 00 00 00 00 00 02  00 00 00 00 00 01 00 00  |................|
09fe6fa0  00 ff ff ff ff ff ff ff  ff 00 00 00 00 00 00 00  |................|
09fe6fb0  00 00 00 00 00 72 aa 3e  57 36 02 d0 11 72 aa 3e  |.....r.>W6...r.>|
09fe6fc0  57 06 64 c1 11 00 00 00  00 00 00 00 00 02 02 20  |W.d............ |
09fe6fd0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
09fe6ff0  00 00 00 03 02 38 00 00  00 00 00 00 00 00 00 00  |.....8..........|
09fe7000  00 00 00 02 00 00 00 00  00 01 00 00 00 00 00 00  |................|
09fe7010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
09fe7030  00 03 02 38 00 00 00 00  00 00 00 00 00 00 00 00  |...8............|
09fe7040  00 02 00 00 00 00 00 01  00 00 00 00 00 00 00 00  |................|
09fe7050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
09fe7060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 a2  |................|
09fe7070  77 55 00 00 00 00 00 00  00 00 00 00 00 00 00 01  |wU..............|
09fe7080  00 00 00 00 00 00 00 ec  55 55 00 00 00 00 00 00  |........UU......|
09fe7090  00 00 00 00 00 00 00 00  00 00 00 ff ff ff ff ff  |................|
09fe70a0  ff ff ff 00 00 00 00 01  01 10 00 00 00 00 00 00  |................|
09fe70b0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
09fe70e0  00 fe ff ff ff ff ff ff  ff 00 00 00 00 0c 00 00  |................|
09fe70f0  00 31 33 33 32 34 36 38  5f 68 65 61 64 b2 01 00  |.1332468_head...|
09fe7100  00 02 00 00 00 00 00 00  00 49 0d 06 87 01 00 00  |.........I......|
09fe7110  85 02 15 00 00 01 00 00  00 00 00 00 72 aa 3e 57  |............r.>W|
09fe7120  e3 63 e1 11 a4 81 00 00  00 00 00 00 00 00 00 00  |.c..............|
09fe7130  01 00 00 00 00 00 00 00  00 00 00 00 00 02 02 18  |................|
09fe7140  00 00 00 00 00 40 00 01  00 00 00 00 00 40 00 05  |.....@.......@..|
09fe7150  00 00 00 00 00 00 00 00  00 00 00 00 00 02 00 00  |................|
09fe7160  00 00 00 01 00 00 00 ff  ff ff ff ff ff ff ff 00  |................|
09fe7170  00 00 00 00 00 00 00 00  00 00 00 72 aa 3e 57 d2  |...........r.>W.|
09fe7180  49 f1 11 72 aa 3e 57 e3  63 e1 11 00 00 00 00 00  |I..r.>W.c.......|
09fe7190  00 00 00 02 02 20 00 00  00 00 00 00 00 00 00 00  |..... ..........|
09fe71a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
09fe71b0  00 00 00 00 00 00 00 00  00 03 02 38 00 00 00 00  |...........8....|
09fe71c0  00 00 00 00 00 00 00 00  00 02 00 00 00 00 00 01  |................|
09fe71d0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*

per xiaoxi, the values of the encoded kv pair of omap is the inodes. and the dumped keys looks right. the size of the hexdump of m->get_data() matches what gdb printed in the comment before the previous one.

(gdb) p ops[0].op.payload_len
$1 = {v = 258}
(gdb) p ops[1].op.payload_len
$2 = {v = 2158368039}
(gdb) p/x ops[1].op.payload_len
$3 = {v = 0x80a61527}
(gdb) p/x ops[2].op.payload_len
$4 = {v = 0x45}
(gdb)

so i think the received MOSDOpReply is very likely a legitimate one.

questions:
  • why we have such a huge data payload? we might want pose a smaller limit here instead of the upper limit of uint64. please note that we encode the size of a bufferlist using uint32, see "encoding.h". so any number greater than 0xffffffff is not a valid "max_return". we could have an assert in librados or somewhere.
    CDir::_omap_fetch() {
      // ...
      if (keys.empty()) {
        assert(!c);
        rd.omap_get_vals("", "", (uint64_t)-1, &fin->omap, &fin->ret2);
      } else {
      // ...
    
  • why the decode fails. maybe we can also walk thru MOSDOpReply::decode_payload() to see if we have any wrong assumption that could cause the decode error.
Actions #6

Updated by Kefu Chai almost 8 years ago

  • Project changed from Ceph to CephFS
  • Status changed from New to 12
  • Assignee changed from Kefu Chai to Zheng Yan
  • Source changed from other to Community (dev)
Actions #7

Updated by Greg Farnum almost 8 years ago

So we actually have so many dentries in the folder that >2GB of omap seems reasonable? I could certainly see how that would break things, but if we're asserting out inside the MOSDOpReply message decode then I think we have a bug with RADOS as well.
2158368039 is not right on the 2GB mark (10884391 extra bytes over the limit), so it's not a simple 31-bit (/32-bit signed) thing with the data bufferlist itself; do we have any clue if the message was actually sent out correctly?

Actions #8

Updated by Xiaoxi Chen almost 8 years ago

Greg Farnum wrote:

So we actually have so many dentries in the folder that >2GB of omap seems reasonable? I could certainly see how that would break things, but if we're asserting out inside the MOSDOpReply message decode then I think we have a bug with RADOS as well.
2158368039 is not right on the 2GB mark (10884391 extra bytes over the limit), so it's not a simple 31-bit (/32-bit signed) thing with the data bufferlist itself; do we have any clue if the message was actually sent out correctly?

We was tried to do hexdump but that is too big so only first 10% is analysised. But can try to use fwrite to write the binary directly.

Also curious the max size of "omap" dir in all of my OSDs is ~ 400MB+, how can we generate 2GB+ data from one omap query? is that because leveldb did some compression?

Actions #9

Updated by Zheng Yan almost 8 years ago

(gdb) bt
#0  ceph::buffer::end_of_buffer::end_of_buffer (this=0xbce18b0) at /home/zhyan/Ceph/ceph/src/include/buffer.h:82
#1  0x0000000001bfc27f in ceph::buffer::list::iterator_impl<false>::advance (this=0x7fffe97eb550, o=-2026078208) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1077
#2  0x0000000001bfc940 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffe97eb550, len=64, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1159
#3  0x0000000001bf3e76 in ceph::buffer::list::iterator::copy (this=0x7fffe97eb550, len=2268888894, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1259
#4  0x0000000001ba6d7a in OSDOp::split_osd_op_vector_out_data (ops=std::vector of length 3, capacity 3 = {...}, in=...) at /home/zhyan/Ceph/ceph/src/osd/osd_types.cc:5567
#5  0x0000000001aa3096 in MOSDOpReply::decode_payload (this=0xbd13400) at /home/zhyan/Ceph/ceph/src/messages/MOSDOpReply.h:230
#6  0x0000000001a96671 in decode_message (cct=0xbc7e000, crcflags=3, header=..., footer=..., front=..., middle=..., data=...) at /home/zhyan/Ceph/ceph/src/msg/Message.cc:763
#7  0x0000000001d44d7c in Pipe::read_message (this=0xbcf2800, pm=0x7fffe97ebfe0, auth_handler=0xbc8b500) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.cc:2094
#8  0x0000000001d3fa6f in Pipe::reader (this=0xbcf2800) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.cc:1606
#9  0x0000000001d48572 in Pipe::Reader::entry (this=0xbcf2818) at /home/zhyan/Ceph/ceph/src/msg/simple/Pipe.h:52
#10 0x0000000001c009a9 in Thread::entry_wrapper (this=0xbcf2818) at /home/zhyan/Ceph/ceph/src/common/Thread.cc:87
#11 0x0000000001c008de in Thread::_entry_func (arg=0xbcf2818) at /home/zhyan/Ceph/ceph/src/common/Thread.cc:67
#12 0x00007ffff631f60a in start_thread () from /lib64/libpthread.so.0
#13 0x00007ffff4b5059d in clone () from /lib64/libc.so.6
(gdb) f 1
#1  0x0000000001bfc27f in ceph::buffer::list::iterator_impl<false>::advance (this=0x7fffe97eb550, o=-2026078208) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1077
1077        throw end_of_buffer();
(gdb) p o
$27 = -2026078208
(gdb) f 2
#2  0x0000000001bfc940 in ceph::buffer::list::iterator_impl<false>::copy (this=0x7fffe97eb550, len=64, dest=...) at /home/zhyan/Ceph/ceph/src/common/buffer.cc:1159
1159          advance(howmuch);
(gdb) p howmuch
$28 = 2268888830
(gdb) 

void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G

Actions #10

Updated by Zheng Yan almost 8 years ago

  • Project changed from CephFS to Ceph
  • Assignee deleted (Zheng Yan)
Actions #11

Updated by Xiaoxi Chen almost 8 years ago

Zheng Yan wrote:

[...]

void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G

@Zheng, but we still need to fix the mismatch between max message size ( 2GB for now and 4GB if we fix this bug, int->uint) and the requested max size in cdir(UINT64-1), right?

Actions #12

Updated by Zheng Yan almost 8 years ago

Xiaoxi Chen wrote:

Zheng Yan wrote:

[...]

void buffer::list::iterator_impl<is_const>::advance(int o) can't handle value that > 2G

@Zheng, but we still need to fix the mismatch between max message size ( 2GB for now and 4GB if we fix this bug, int->uint) and the requested max size in cdir(UINT64-1), right?

it's more like omap_get_vals() issue.

Actions #13

Updated by Kefu Chai almost 8 years ago

  • Assignee set to Kefu Chai
Actions #14

Updated by Kefu Chai almost 8 years ago

  • Status changed from 12 to 4
Actions #15

Updated by Kefu Chai almost 8 years ago

  • Status changed from 4 to Fix Under Review
Actions #16

Updated by Kefu Chai almost 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF