Project

General

Profile

Actions

Bug #63378

open

rgw/multisite: Segmentation fault during full sync

Added by Shilpa MJ 6 months ago. Updated 21 days ago.

Status:
New
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
multisite
Backport:
reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/smanjara-2023-10-30_20:18:36-rgw:multisite-wip-shilpa-rgw-test-multisite-distro-default-smithi/7441423/teuthology.log

2023-10-30T22:07:41.493+0000 7f4899a5a640 20 rgw rados thread: cr:s=0x55e0b6295900:op=0x55e0b6478000:28RGWDataFullSyncSingleEntryCR: operate()
2023-10-30T22:07:41.494+0000 7f4899a5a640 -1 ** Caught signal (Segmentation fault) *
in thread 7f4899a5a640 thread_name:data-sync

ceph version 18.0.0-6880-g8b1cc681 (8b1cc681d09f809ade48e839fde79ae1b6bd1850) reef (dev)
1: /lib64/libc.so.6(+0x54db0) [0x7f48c2454db0]
2: radosgw(+0xc8a07d) [0x55e0aefe807d]
3: radosgw(+0x38ad82) [0x55e0ae6e8d82]
4: radosgw(+0x836fa9) [0x55e0aeb94fa9]
5: radosgw(+0x9d14c7) [0x55e0aed2f4c7]
6: (RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x125) [0x55e0ae90f405]
7: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x2b6) [0x55e0ae910c76]
8: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0xad) [0x55e0ae911c2d]
9: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x4dc) [0x55e0aed3c02c]
10: radosgw(+0x781f08) [0x55e0aeadff08]
11: (RGWRadosThread::Worker::entry()+0xb3) [0x55e0aeae2413]
12: /lib64/libc.so.6(+0x9f802) [0x7f48c249f802]
13: /lib64/libc.so.6(+0x3f450) [0x7f48c243f450]
Actions #1

Updated by Casey Bodley 6 months ago

  • Priority changed from Normal to Urgent
  • Tags set to multisite
Actions #2

Updated by Casey Bodley 6 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 54278
Actions #3

Updated by Steven Goodliff 6 months ago

hi,

i think i see the same on our test 18.2.0 dev cluster. if there is any info you need let us know

*** Caught signal (Segmentation fault) **
 in thread 7faf74ace700 thread_name:data-sync

 ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fafcb3e9cf0]
 2: (RGWCoroutinesStack::_schedule()+0xe) [0x55c2ce5782ae]
 3: (RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0xdc5) [0x55c2ce57afd5]
 4: (RGWCoroutinesManager::run(DoutPrefixProvider const*, RGWCoroutine*)+0x91) [0x55c2ce57b721]
 5: (RGWRemoteDataLog::run_sync(DoutPrefixProvider const*, int)+0x1e2) [0x55c2ceaff352]
 6: (RGWDataSyncProcessorThread::process(DoutPrefixProvider const*)+0x58) [0x55c2ce846d18]
 7: (RGWRadosThread::Worker::entry()+0xb3) [0x55c2ce80e003]
 8: /lib64/libpthread.so.0(+0x81ca) [0x7fafcb3df1ca]
 9: clone()
Actions #4

Updated by Casey Bodley 6 months ago

  • Backport set to reef
Actions #5

Updated by Steven Goodliff 6 months ago

Hi,

is this likely to get into the 18.2.1 release ? https://tracker.ceph.com/versions/675

Actions #6

Updated by Casey Bodley 5 months ago

  • Status changed from Fix Under Review to In Progress
  • Pull request ID deleted (54278)
Actions #7

Updated by Shilpa MJ 4 months ago

this crash seems to be coming from 'data_sync_init' test cases in:
/ceph/qa/tasks/rgw_multi/tests.py

the crash doesn't reproduce locally, but reproduces consistently in teuthology runs.

Actions #8

Updated by Shilpa MJ 3 months ago

crash reproduces only in 3-zone or two-zonegroup configurations

Actions #9

Updated by Casey Bodley 21 days ago

  • Status changed from In Progress to New
Actions

Also available in: Atom PDF