Project

General

Profile

Actions

Bug #54130

closed

OpsLogRados::log segfaults in rgw/multisite suite

Added by Casey Bodley about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

100%

Source:
Tags:
opslog
Backport:
octopus pacific quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2022-01-29T00:31:31.693 INFO:tasks.rgw_multisite_tests:running rgw multisite tests on '/home/teuthworker/src/github.com_ceph_ceph-c_7c1eb0cd5f8e9dc658d4fb8519b7c8bccf11fbd6/qa/../src/test/rgw/rgw_multi' with args=['tests.py']
2022-01-29T00:31:31.700 INFO:rgw_multi.tests:create bucket zone=a1 name=swdzrg-1
2022-01-29T00:31:35.011 INFO:rgw_multi.tests:create bucket zone=a2 name=swdzrg-2
*** Caught signal (Segmentation fault) **
 in thread 7fa8e16e2700 thread_name:radosgw
 ceph version 17.0.0-10459-g7c1eb0cd (7c1eb0cd5f8e9dc658d4fb8519b7c8bccf11fbd6) quincy (dev)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7faa8867d3c0]
 2: (OpsLogRados::log(req_state*, rgw_log_entry&)+0x123) [0x7faa88b7d9a3]
 3: (OpsLogManifold::log(req_state*, rgw_log_entry&)+0x3e) [0x7faa88b7ab9e]
 4: (rgw_log_op(RGWREST*, req_state*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, OpsLogSink*)+0xd62) [0x7faa88b7e942]
 5: (process_request(rgw::sal::Store*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSink*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*, std::shared_ptr<RateLimiter>, int*)+0x1433) [0x7faa88b9c853]
 6: /lib/libradosgw.so.2(+0x473384) [0x7faa88afe384]
 7: /lib/libradosgw.so.2(+0x47469a) [0x7faa88aff69a]
 8: /lib/libradosgw.so.2(+0x47481c) [0x7faa88aff81c]
 9: make_fcontext()

from a recent master baseline, two rgw/multisite jobs failed this way:
http://qa-proxy.ceph.com/teuthology/yuriw-2022-01-28_16:15:58-rgw-wip-master-1.27.22-distro-default-smithi/6646932/teuthology.log
http://qa-proxy.ceph.com/teuthology/yuriw-2022-01-28_16:15:58-rgw-wip-master-1.27.22-distro-default-smithi/6646954/teuthology.log


Related issues 3 (0 open3 closed)

Copied to rgw - Backport #54162: quincy: OpsLogRados::log segfaults in rgw/multisite suiteResolvedActions
Copied to rgw - Backport #54536: octopus: OpsLogRados::log segfaults in rgw/multisite suiteRejectedActions
Copied to rgw - Backport #54537: pacific: OpsLogRados::log segfaults in rgw/multisite suiteResolvedCory SnyderActions
Actions #1

Updated by Cory Snyder about 2 years ago

Seems to be caused by the Store being de-allocated when a realm is reloaded. OpsLogRados is not reinitialized to use the new Store created by the reload. We can fix this by using the same pattern that is used for the UsageLogger: make the OpsLogRados instance a static variable within rgw_log.cc and create init/finalize methods to manage it's lifecycle. The realm reloader can then call these methods to refresh the logger when it reloads.

Actions #2

Updated by Casey Bodley about 2 years ago

thanks for taking a look! another option to consider is letting RGWRados own OpsLogRados and handle its init/shutdown. that way, OpsLogRados never has a dangling pointer and RGWRealmReloader doesn't need another special case

Actions #3

Updated by Casey Bodley about 2 years ago

  • Status changed from New to Fix Under Review
  • Backport set to quincy
  • Pull request ID set to 44893
Actions #4

Updated by Casey Bodley about 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54162: quincy: OpsLogRados::log segfaults in rgw/multisite suite added
Actions #6

Updated by Casey Bodley about 2 years ago

  • Backport changed from quincy to octopus pacific quincy
Actions #7

Updated by Casey Bodley about 2 years ago

it looks like the OpsLogManifold stuff was backported further than quincy, and i'm seeing the crashes in pacific testing too

Actions #8

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54536: octopus: OpsLogRados::log segfaults in rgw/multisite suite added
Actions #9

Updated by Backport Bot about 2 years ago

  • Copied to Backport #54537: pacific: OpsLogRados::log segfaults in rgw/multisite suite added
Actions #10

Updated by Cory Snyder about 2 years ago

Note that I closed the Octopus backport tracker since the offending ops log changes were never backported to that release.

Actions #11

Updated by Backport Bot over 1 year ago

  • Tags changed from opslog to opslog backport_processed
Actions #12

Updated by Konstantin Shalygin over 1 year ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
  • Tags changed from opslog backport_processed to opslog
Actions

Also available in: Atom PDF