Project

General

Profile

Actions

Bug #55131

open

radosgw crashes at RGWIndexCompletionManager::create_completion

Added by Yuanguo Huo about 2 years ago. Updated over 1 year ago.

Status:
Pending Backport
Priority:
Normal
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
quincy,pacific,octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/mimic-p2p
Pull request ID:
Crash signature (v1):

Mutex::lock

Crash signature (v2):

RGWIndexCompletionManager::create_completion


Description

I have a cluster and several radosgw instances. After running for some time (about 2 months), the radosgw instances crashes one by one, and the stack looks like this:

"(()+0xf100) [0x7fef0a7e9100]",
"(Mutex::lock(bool)+0x9) [0x7fef0dbf0929]",
"(RGWIndexCompletionManager::create_completion(rgw_obj const&, RGWModifyOp, std::string&, rgw_bucket_entry_ver&, cls_rgw_obj_key const&, rgw_bucket_dir_entry_meta&, std::list<cls_rgw_obj_key, std::allocator<cls_rgw_obj_key> >, bool, unsigned short, std::set<std::string, std::less<std::string>, std::allocator<std::string> >, complete_op_data**)+0x4b5) [0x5606ceaf6bd5]",
"(RGWRados::cls_obj_complete_op(RGWRados::BucketShard&, rgw_obj const&, RGWModifyOp, std::string&, long, unsigned long, rgw_bucket_dir_entry&, RGWObjCategory, std::list<cls_rgw_obj_key, std::allocator<cls_rgw_obj_key> >, unsigned short, std::set<std::string, std::less<std::string>, std::allocator<std::string> >)+0x2b6) [0x5606ceaf6f36]",
"(RGWRados::cls_obj_complete_add(RGWRados::BucketShard&, rgw_obj const&, std::string&, long, unsigned long, rgw_bucket_dir_entry&, RGWObjCategory, std::list<cls_rgw_obj_key, std::allocator<cls_rgw_obj_key> >, unsigned short, std::set<std::string, std::less<std::string>, std::allocator<std::string> >)+0x2a) [0x5606ceaf701a]",
"(RGWRados::Bucket::UpdateIndex::complete(long, unsigned long, unsigned long, unsigned long, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >&, std::string const&, std::string const&, std::string const&, ceph::buffer::v14_2_0::list*, RGWObjCategory, std::list<cls_rgw_obj_key, std::allocator<cls_rgw_obj_key> >, std::string const, bool)+0x324) [0x5606ceb073d4]",
"(RGWRados::Object::Write::_do_write_meta(unsigned long, unsigned long, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >&, bool, bool, void*)+0xdc7) [0x5606ceb1a237]",
"(RGWRados::Object::Write::write_meta(unsigned long, unsigned long, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >&)+0x25a) [0x5606ceb1b4ba]",
"(rgw::putobj::AtomicObjectProcessor::complete(unsigned long, std::string const&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::string, ceph::buffer::v14_2_0::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::v14_2_0::list> > >&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, char const, char const*, std::string const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >, bool)+0x250) [0x5606ceae2400]",
"(RGWPutObj::execute()+0x35d8) [0x5606ceab6348]",
"(rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x915) [0x5606ce82bf05]",
"(process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*)+0x1d6c) [0x5606ce82e30c]",
"(RGWCivetWebFrontend::process(mg_connection*)+0x38e) [0x5606ce77209e]",
"(()+0x3a5fee) [0x5606ce7fcfee]",
"(()+0x3a7c8f) [0x5606ce7fec8f]",
"(()+0x3a8138) [0x5606ce7ff138]",
"(()+0x7dc5) [0x7fef0a7e1dc5]",
"(clone()+0x6d) [0x7fef09cedced]"

Related issues 3 (1 open2 closed)

Copied to rgw - Backport #55501: octopus: radosgw crashes at RGWIndexCompletionManager::create_completionRejectedActions
Copied to rgw - Backport #55502: quincy: radosgw crashes at RGWIndexCompletionManager::create_completionIn ProgressKonstantin ShalyginActions
Copied to rgw - Backport #55503: pacific: radosgw crashes at RGWIndexCompletionManager::create_completionRejectedJ. Eric IvancichActions
Actions #1

Updated by Yuanguo Huo about 2 years ago

I am working on this issue

Actions #2

Updated by Yuanguo Huo about 2 years ago

Root Cause: cur_shard is of type std::atomic<int>, which may overflow and cause function next_shard() to return a negative value. As a result, in function RGWIndexCompletionManager::create_completion(), shard_id (a negative value) will be out of bound of array locks. That is to say, some random memory space is used as lock, thus pthread_mutext_lock() fails, which fails the ceph assertion;

Actions #3

Updated by Casey Bodley about 2 years ago

  • Status changed from New to In Progress
Actions #4

Updated by J. Eric Ivancich about 2 years ago

  • Assignee set to J. Eric Ivancich
Actions #5

Updated by J. Eric Ivancich about 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 45882
Actions #6

Updated by J. Eric Ivancich almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to quincy,pacific,octopus
Actions #7

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55501: octopus: radosgw crashes at RGWIndexCompletionManager::create_completion added
Actions #8

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55502: quincy: radosgw crashes at RGWIndexCompletionManager::create_completion added
Actions #9

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #55503: pacific: radosgw crashes at RGWIndexCompletionManager::create_completion added
Actions #10

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF