Project

General

Profile

Actions

Bug #24117

closed

cls_bucket_list fails causes cascading osd crashes

Added by Nick Janus almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description

A cls_bucket_list operation on a certain bucket in our cluster caused a cascading failure as osds would attempt the operation and then crash. The crash was due to an invalid string to long long conversion while parsing an omap key (https://github.com/ceph/ceph/blob/luminous/src/cls/rgw/cls_rgw.cc#L325-L359). Looking at the bucket's omap entries, we have found a number of binary strings that don't adhere to the formats expected in the rgw code. The dump of the keys and values goes over the attachment limit, but here's a sample (the bucket only has one index object and one s3 object):

myfilename
value (213 bytes) :
00000000  08 03 cf 00 00 00 0a 00  00 00 6d 79 66 69 6c 65  |..........myfile|
00000010  6e 61 6d 65 ab 30 01 00  00 00 00 00 01 05 03 59  |name.0.........Y|
...etc - we are able to decode the entry using ceph-dencoder, and it corresponds to the single object in the bucket

key (299638 bytes):
00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 01 00 48  |......JFIF.....H|
00000010  00 48 00 00 ff db 00 43  00 0a 07 07 08 07 06 0a  |.H.....C........|
00000020  08 08 08 0b 0a 0a 0b 0e  18 10 0e 0d 0d 0e 1d 15  |................|
00000030  16 11 18 23 1f 25 24 22  1f 22 21 26 2b 37 2f 26  |...#.%$"."!&+7/&|
00000040  29 34 29 21 22 30 41 31  34 39 3b 3e 3e 3e 25 2e  |)4)!"0A149;>>>%.|
...etc

value (299751 bytes) :
00000000  08 03 e1 92 04 00 76 92  04 00 ff d8 ff e0 00 10  |......v.........|
00000010  4a 46 49 46 00 01 01 01  00 48 00 48 00 00 ff db  |JFIF.....H.H....|
00000020  00 43 00 0a 07 07 08 07  06 0a 08 08 08 0b 0a 0a  |.C..............|
00000030  0b 0e 18 10 0e 0d 0d 0e  1d 15 16 11 18 23 1f 25  |.............#.%|
00000040  24 22 1f 22 21 26 2b 37  2f 26 29 34 29 21 22 30  |$"."!&+7/&)4)!"0|
00000050  41 31 34 39 3b 3e 3e 3e  25 2e 44 49 43 3c 48 37  |A149;>>>%.DIC<H7|
00000060  3d 3e 3b ff db 00 43 01  0a 0b 0b 0e 0d 0e 1c 10  |=>;...C.........|
00000070  10 1c 3b 28 22 28 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |..;("(;;;;;;;;;;|
00000080  3b 3b 3b 3b 3b 3b 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |;;;;;;;;;;;;;;;;|
00000090  3b 3b 3b 3b 3b 3b 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |;;;;;;;;;;;;;;;;|
000000a0  3b 3b 3b 3b 3b 3b 3b 3b  ff c2 00 11 08 04 2b 06  |;;;;;;;;......+.|
...etc

Error with logging turned up:

12.2.2/src/cls/rgw/cls_rgw.cc: In function 'void decode_list_index_key(const string&, cls_rgw_obj_key*, uint64_t*)' thread 7f0032bc5700 time 2018-04-26 20:53:57.301220
/build/ceph-12.2.2/src/cls/rgw/cls_rgw.cc: 356: FAILED assert(err.empty()

Stack trace that we encountered:

     0> 2018-04-26 21:13:24.246445 7fe4f60b0700 -1 *** Caught signal (Aborted) **
 in thread 7fe4f60b0700 thread_name:tp_osd_tp

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0xa12e99) [0x5599d5d38e99]
 2: (()+0x10330) [0x7fe518451330]
 3: (gsignal()+0x37) [0x7fe517471c37]
 4: (abort()+0x148) [0x7fe517475028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x280) [0x5599d5d759a0]
 6: (rgw_bucket_list(void*, ceph::buffer::list*, ceph::buffer::list*)+0x1141) [0x7fe508ca08f1]
 7: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::list&, ceph::buffer::list&)+0x24) [0x5599d5890024]
 8: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x11ab) [0x5599d59a15eb]
 9: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x8f) [0x5599d59b221f]
 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x723) [0x5599d59b2f73]
 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x30f6) [0x5599d59b77a6]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe66) [0x5599d5975696]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x5599d58098b6]
 14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5599d5a70687]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xff5) [0x5599d5836b85]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5599d5d7afef]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5599d5d7cf40]
 18: (()+0x8184) [0x7fe518449184]
 19: (clone()+0x6d) [0x7fe517538ffd]

Our investigation is now focusing on leveldb, but any ideas/insights would be appreciated!


Related issues 2 (0 open2 closed)

Copied to rgw - Backport #24630: luminous: cls_bucket_list fails causes cascading osd crashesResolvedNathan CutlerActions
Copied to rgw - Backport #24631: mimic: cls_bucket_list fails causes cascading osd crashesResolvedNathan CutlerActions
Actions

Also available in: Atom PDF