Project

General

Profile

Bug #24117

cls_bucket_list fails causes cascading osd crashes

Added by Nick Janus 7 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
Start date:
05/08/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:

Description

Description

A cls_bucket_list operation on a certain bucket in our cluster caused a cascading failure as osds would attempt the operation and then crash. The crash was due to an invalid string to long long conversion while parsing an omap key (https://github.com/ceph/ceph/blob/luminous/src/cls/rgw/cls_rgw.cc#L325-L359). Looking at the bucket's omap entries, we have found a number of binary strings that don't adhere to the formats expected in the rgw code. The dump of the keys and values goes over the attachment limit, but here's a sample (the bucket only has one index object and one s3 object):

myfilename
value (213 bytes) :
00000000  08 03 cf 00 00 00 0a 00  00 00 6d 79 66 69 6c 65  |..........myfile|
00000010  6e 61 6d 65 ab 30 01 00  00 00 00 00 01 05 03 59  |name.0.........Y|
...etc - we are able to decode the entry using ceph-dencoder, and it corresponds to the single object in the bucket

key (299638 bytes):
00000000  ff d8 ff e0 00 10 4a 46  49 46 00 01 01 01 00 48  |......JFIF.....H|
00000010  00 48 00 00 ff db 00 43  00 0a 07 07 08 07 06 0a  |.H.....C........|
00000020  08 08 08 0b 0a 0a 0b 0e  18 10 0e 0d 0d 0e 1d 15  |................|
00000030  16 11 18 23 1f 25 24 22  1f 22 21 26 2b 37 2f 26  |...#.%$"."!&+7/&|
00000040  29 34 29 21 22 30 41 31  34 39 3b 3e 3e 3e 25 2e  |)4)!"0A149;>>>%.|
...etc

value (299751 bytes) :
00000000  08 03 e1 92 04 00 76 92  04 00 ff d8 ff e0 00 10  |......v.........|
00000010  4a 46 49 46 00 01 01 01  00 48 00 48 00 00 ff db  |JFIF.....H.H....|
00000020  00 43 00 0a 07 07 08 07  06 0a 08 08 08 0b 0a 0a  |.C..............|
00000030  0b 0e 18 10 0e 0d 0d 0e  1d 15 16 11 18 23 1f 25  |.............#.%|
00000040  24 22 1f 22 21 26 2b 37  2f 26 29 34 29 21 22 30  |$"."!&+7/&)4)!"0|
00000050  41 31 34 39 3b 3e 3e 3e  25 2e 44 49 43 3c 48 37  |A149;>>>%.DIC<H7|
00000060  3d 3e 3b ff db 00 43 01  0a 0b 0b 0e 0d 0e 1c 10  |=>;...C.........|
00000070  10 1c 3b 28 22 28 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |..;("(;;;;;;;;;;|
00000080  3b 3b 3b 3b 3b 3b 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |;;;;;;;;;;;;;;;;|
00000090  3b 3b 3b 3b 3b 3b 3b 3b  3b 3b 3b 3b 3b 3b 3b 3b  |;;;;;;;;;;;;;;;;|
000000a0  3b 3b 3b 3b 3b 3b 3b 3b  ff c2 00 11 08 04 2b 06  |;;;;;;;;......+.|
...etc

Error with logging turned up:

12.2.2/src/cls/rgw/cls_rgw.cc: In function 'void decode_list_index_key(const string&, cls_rgw_obj_key*, uint64_t*)' thread 7f0032bc5700 time 2018-04-26 20:53:57.301220
/build/ceph-12.2.2/src/cls/rgw/cls_rgw.cc: 356: FAILED assert(err.empty()

Stack trace that we encountered:

     0> 2018-04-26 21:13:24.246445 7fe4f60b0700 -1 *** Caught signal (Aborted) **
 in thread 7fe4f60b0700 thread_name:tp_osd_tp

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0xa12e99) [0x5599d5d38e99]
 2: (()+0x10330) [0x7fe518451330]
 3: (gsignal()+0x37) [0x7fe517471c37]
 4: (abort()+0x148) [0x7fe517475028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x280) [0x5599d5d759a0]
 6: (rgw_bucket_list(void*, ceph::buffer::list*, ceph::buffer::list*)+0x1141) [0x7fe508ca08f1]
 7: (ClassHandler::ClassMethod::exec(void*, ceph::buffer::list&, ceph::buffer::list&)+0x24) [0x5599d5890024]
 8: (PrimaryLogPG::do_osd_ops(PrimaryLogPG::OpContext*, std::vector<OSDOp, std::allocator<OSDOp> >&)+0x11ab) [0x5599d59a15eb]
 9: (PrimaryLogPG::prepare_transaction(PrimaryLogPG::OpContext*)+0x8f) [0x5599d59b221f]
 10: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x723) [0x5599d59b2f73]
 11: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x30f6) [0x5599d59b77a6]
 12: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xe66) [0x5599d5975696]
 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3e6) [0x5599d58098b6]
 14: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x47) [0x5599d5a70687]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xff5) [0x5599d5836b85]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x83f) [0x5599d5d7afef]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5599d5d7cf40]
 18: (()+0x8184) [0x7fe518449184]
 19: (clone()+0x6d) [0x7fe517538ffd]

Our investigation is now focusing on leveldb, but any ideas/insights would be appreciated!


Related issues

Copied to rgw - Backport #24630: luminous: cls_bucket_list fails causes cascading osd crashes Resolved
Copied to rgw - Backport #24631: mimic: cls_bucket_list fails causes cascading osd crashes Resolved

History

#1 Updated by Greg Farnum 7 months ago

  • Project changed from Ceph to rgw

#2 Updated by Orit Wasserman 7 months ago

  • Priority changed from Normal to High

#3 Updated by Orit Wasserman 7 months ago

  • Assignee set to Orit Wasserman

#4 Updated by Nick Janus 7 months ago

Since we got tripped up but this a couple more times, we ended up deleting the key. This seems to have restored access to the bucket and prevented further instability. Unfortunately, we're not really closer on figuring out how the key was written/corrupted in the first place.

I'm very new to the Ceph code base, but let me know if I can contribute towards improving the error handling and logging here. It would be lovely if the list_bucket op simply failed and logged the error instead of crashing on an assert.

#5 Updated by Yehuda Sadeh 6 months ago

  • Assignee changed from Orit Wasserman to Yehuda Sadeh

Need to remove assertions from the objclass code.

#6 Updated by Yehuda Sadeh 6 months ago

  • Status changed from New to Triaged

#7 Updated by Yehuda Sadeh 6 months ago

  • Backport set to mimic, luminous

This should fix the crash issue:
https://github.com/ceph/ceph/pull/22440

#8 Updated by Yehuda Sadeh 6 months ago

  • Status changed from Triaged to Need Review

#9 Updated by Yehuda Sadeh 6 months ago

  • Status changed from Need Review to Testing

#10 Updated by Casey Bodley 6 months ago

  • Status changed from Testing to Pending Backport

#11 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #24630: luminous: cls_bucket_list fails causes cascading osd crashes added

#12 Updated by Nathan Cutler 6 months ago

  • Copied to Backport #24631: mimic: cls_bucket_list fails causes cascading osd crashes added

#13 Updated by Nathan Cutler 2 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF