Bug #24117
closedcls_bucket_list fails causes cascading osd crashes
0%
Description
Description¶
A cls_bucket_list operation on a certain bucket in our cluster caused a cascading failure as osds would attempt the operation and then crash. The crash was due to an invalid string to long long conversion while parsing an omap key (https://github.com/ceph/ceph/blob/luminous/src/cls/rgw/cls_rgw.cc#L325-L359). Looking at the bucket's omap entries, we have found a number of binary strings that don't adhere to the formats expected in the rgw code. The dump of the keys and values goes over the attachment limit, but here's a sample (the bucket only has one index object and one s3 object):
myfilename
value (213 bytes) :
00000000 08 03 cf 00 00 00 0a 00 00 00 6d 79 66 69 6c 65 |..........myfile|
00000010 6e 61 6d 65 ab 30 01 00 00 00 00 00 01 05 03 59 |name.0.........Y|
...etc - we are able to decode the entry using ceph-dencoder, and it corresponds to the single object in the bucket
key (299638 bytes):
00000000 ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 00 48 |......JFIF.....H|
00000010 00 48 00 00 ff db 00 43 00 0a 07 07 08 07 06 0a |.H.....C........|
00000020 08 08 08 0b 0a 0a 0b 0e 18 10 0e 0d 0d 0e 1d 15 |................|
00000030 16 11 18 23 1f 25 24 22 1f 22 21 26 2b 37 2f 26 |...#.%$"."!&+7/&|
00000040 29 34 29 21 22 30 41 31 34 39 3b 3e 3e 3e 25 2e |)4)!"0A149;>>>%.|
...etc
value (299751 bytes) :
00000000 08 03 e1 92 04 00 76 92 04 00 ff d8 ff e0 00 10 |......v.........|
00000010 4a 46 49 46 00 01 01 01 00 48 00 48 00 00 ff db |JFIF.....H.H....|
00000020 00 43 00 0a 07 07 08 07 06 0a 08 08 08 0b 0a 0a |.C..............|
00000030 0b 0e 18 10 0e 0d 0d 0e 1d 15 16 11 18 23 1f 25 |.............#.%|
00000040 24 22 1f 22 21 26 2b 37 2f 26 29 34 29 21 22 30 |$"."!&+7/&)4)!"0|
00000050 41 31 34 39 3b 3e 3e 3e 25 2e 44 49 43 3c 48 37 |A149;>>>%.DIC<H7|
00000060 3d 3e 3b ff db 00 43 01 0a 0b 0b 0e 0d 0e 1c 10 |=>;...C.........|
00000070 10 1c 3b 28 22 28 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b |..;("(;;;;;;;;;;|
00000080 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b |;;;;;;;;;;;;;;;;|
00000090 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b 3b |;;;;;;;;;;;;;;;;|
000000a0 3b 3b 3b 3b 3b 3b 3b 3b ff c2 00 11 08 04 2b 06 |;;;;;;;;......+.|
...etc
Error with logging turned up:
12.2.2/src/cls/rgw/cls_rgw.cc: In function 'void decode_list_index_key(const string&, cls_rgw_obj_key*, uint64_t*)' thread 7f0032bc5700 time 2018-04-26 20:53:57.301220
/build/ceph-12.2.2/src/cls/rgw/cls_rgw.cc: 356: FAILED assert(err.empty()
Stack trace that we encountered:
0> 2018-04-26 21:13:24.246445 7fe4f60b0700 -1 *** Caught signal (Aborted) **
in thread 7fe4f60b0700 thread_name:tp_osd_tp
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
1: (()+0xa12e99) [0x5599d5d38e99]
2: (()+0x10330) [0x7fe518451330]
3: (gsignal()+0x37) [0x7fe517471c37]
4: (abort()+0x148) [0x7fe517475028]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x280) [0x5599d5d759a0]
6: (rgw_bucket_list(void*, ceph::buffer::list*, ceph::buffer::list*)+0x1141) [0x7fe508ca08f1]
7: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::list&, ceph::buffer::list&)+0x24) [0x5599d5890024]
8: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x11ab) [0x5599d59a15eb]
9: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x8f) [0x5599d59b221f]
10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x723) [0x5599d59b2f73]
11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x30f6) [0x5599d59b77a6]
12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe66) [0x5599d5975696]
13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x5599d58098b6]
14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5599d5a70687]
15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xff5) [0x5599d5836b85]
16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5599d5d7afef]
17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5599d5d7cf40]
18: (()+0x8184) [0x7fe518449184]
19: (clone()+0x6d) [0x7fe517538ffd]
Our investigation is now focusing on leveldb, but any ideas/insights would be appreciated!