Project

General

Profile

Actions

Bug #55512

closed

rgw: crash in RGWSyncLogTrimThread::process()

Added by Soumya Koduri almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In a multisite environment, below crash is observed when restarting radosgw servers -

(gdb) bt
#0 0x00007ffff751fa41 in RGWRESTConn::RGWRESTConn (this=0x555556c40000,
cct=<optimized out>, store=0x555556aeeb00,
_remote_id="da140565-8396-4939-befd-25bd553a646e", remote_endpoints=...,
_api_name=std::optional<std::string> = {...}, _host_style=PathStyle)
at ../src/rgw/rgw_rest_conn.cc:25
#1 0x00007ffff77eff91 in (anonymous namespace)::make_peer_connections<std::map<std::
_cxx11::basic_string<char>, RGWZoneGroup> > (zonegroups=..., store=0x555556aeeb00)
at /usr/include/c++/10/bits/basic_string.h:907
#2 (anonymous namespace)::MasterTrimEnv::MasterTrimEnv (this=<optimized out>,
dpp=<optimized out>, store=0x555556aeeb00, http=<optimized out>,
num_shards=<optimized out>) at ../src/rgw/rgw_trim_mdlog.cc:231
#3 0x00007ffff77f5cd6 in MetaMasterTrimPollCR::MetaMasterTrimPollCR (interval=...,
num_shards=64, http=0x555556bb68b8, store=0x555556aeeb00, dpp=0x555556bb67c8,
this=0x555555ea0800) at ../src/rgw/rgw_trim_mdlog.cc:660
#4 create_meta_log_trim_cr (dpp=dpp@entry=0x555556bb67c8, store=0x555556aeeb00,
http=http@entry=0x555556bb68b8, num_shards=<optimized out>, interval=...)
at ../src/rgw/rgw_trim_mdlog.cc:723
#5 0x00007ffff7503e68 in RGWSyncLogTrimThread::process (this=0x555556bb6780,
dpp=0x555555e92be8) at ../src/common/config_proxy.h:120
#6 0x00007ffff74bc558 in RGWRadosThread::Worker::entry (this=0x555555e92bb0)
at ../src/rgw/rgw_rados.cc:355
#7 0x00007ffff572a432 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff52709d3 in clone () from /lib64/libc.so.6
(gdb) p (rgw::sal::RadosStore*)store
$3 = (rgw::sal::RadosStore *) 0x555556aeeb00
(gdb) p *$
$4 = {<rgw::sal::Store> = {
_vptr.Store = 0x7ffff7f6e870 <vtable for rgw::sal::RadosStore+16>},
rados = 0x555555dd6400, user_ctl = 0x0, luarocks_path = "",
zone = std::unique_ptr<rgw::sal::RadosZone> = {get() = 0x0}}
(gdb)

>>> This seems to be regression with change - https://github.com/ceph/ceph/pull/45623/ .
Actions #1

Updated by Soumya Koduri almost 2 years ago

RGWSyncLogTrimThread (initialized & added to worker thread as part of RGWRados::Initialize()->init_complete()) tries to access store->get_zone().

if ((*rados).set_use_cache(use_cache)
.set_use_datacache(false)
.set_use_gc(use_gc)
.set_run_gc_thread(use_gc_thread)
.set_run_lc_thread(use_lc_thread)
.set_run_quota_threads(quota_threads)
.set_run_sync_thread(run_sync_thread)
.set_run_reshard_thread(run_reshard_thread)
.initialize(cct, dpp) < 0) {
delete store;
return nullptr;
}
if (store->initialize(cct, dpp) < 0) {
delete store;
return nullptr;
}

^^^
But the store->zone is initialized at the later step here. Maybe the ordering of these two calls needs to be changed.

Actions #2

Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Resolved
  • Pull request ID set to 46115
Actions

Also available in: Atom PDF