Project

General

Profile

Actions

Bug #55566

closed

[RGW-MS][DBR]:rgw crash observed in thread_name:sync-log-trim on the secondary, when 20M objects uploaded for test-bucket-1

Added by Vidushi Mishra almost 2 years ago. Updated almost 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
multisite-reshard
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

rgw crashed in thread_name:sync-log-trim on the secondary when 20M objects uploaded for test-bucket-1

a snippet of backtrace: =========================
[ceph: root@magna121 /]# ceph crash info 2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e {
"assert_condition": "cursor",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc",
"assert_func": "virtual int PurgePeriodLogsCR::operate(const DoutPrefixProvider*)",
"assert_line": 94,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc: In function 'virtual int PurgePeriodLogsCR::operate(const DoutPrefixProvider*)' thread 7f067702c700 time 2022-05-05T13:59:21.519320+0000\n/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc: 94: FAILED ceph_assert(cursor)\n",
"assert_thread_name": "sync-log-trim",
"backtrace": [
"/lib64/libpthread.so.0(+0x12ce0) [0x7f06a3d7ece0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f06a46f8452]",
"/usr/lib64/ceph/libceph-common.so.2(+0x283615) [0x7f06a46f8615]",
"(PurgePeriodLogsCR::operate(DoutPrefixProvider const*)+0xcb6) [0x7f06a7071ff6]",
"(RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x15c) [0x7f06a6ccf4fc]",
"(RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x296) [0x7f06a6cd0336]",
"(RGWSyncLogTrimThread::process(DoutPrefixProvider const*)+0x23d) [0x7f06a6db5dad]",
"(RGWRadosThread::Worker::entry()+0x13a) [0x7f06a6d6029a]",
"/lib64/libpthread.so.0(+0x81cf) [0x7f06a3d741cf]",
"clone()"
],
"ceph_version": "17.0.0-10783-ge38464a1",
"crash_id": "2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e",
"entity_name": "client.rgw.usa.extensa033.uoxrfx",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "radosgw",
"stack_sig": "d59436d026384362dac809a48bf6b87371a937c4ca16641c4533dd4751976dc9",
"timestamp": "2022-05-05T13:59:21.524744Z",
"utsname_hostname": "extensa033",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-348.20.1.el8_5.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Mar 8 12:56:54 EST 2022"
}

1. ceph version :
ceph version 17.0.0-12013-gf373d079

2. Steps:

a. create a bucket 'test-bucket-1' on the primary and trigger a 20M bi-directional workload [10M from each site.]

3. Result:

rgw crashes observed

4. Multisite configuration used:

  1. Multisite Configuration = No LB for ms-sync
  2. Total RGW daemons per zone = 14
  3. sync endpoints = 4 rgws (not in client IO)
  4. IO via = Haproxy ( client IO for 10 rgws)
  5. Object size = small size( 1- 10KB)
  6. Object PUT = bi-directional on same bucket
  7. objects per bucket = 20M (10M from each site simultaneously)
  8. Cluster utilization < 40% total cluster size
  9. IO tool = Cosbench
  10. Configs set on all rgws rgw_data_notify_interval_msec=0
    debug_ms 0
    debug_rgw 5
    debug_rgw_sync 20
Actions #1

Updated by Vidushi Mishra almost 2 years ago

[ceph: root@magna121 /]# ceph crash ls
ID ENTITY NEW
2022-05-05T13:39:21.503250Z_a70de8c0-5065-48a8-9611-fa3a14fb766d client.rgw.usa.extensa031.ntwajy *
2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e client.rgw.usa.extensa033.uoxrfx *
[ceph: root@magna121 /]#

Actions #2

Updated by Mark Kogan almost 2 years ago

  • Status changed from New to Duplicate
Actions #3

Updated by Mark Kogan almost 2 years ago

duplicate (and fixed in) of https://tracker.ceph.com/issues/40341 -- multisite: failed assert(cursor) in mdlog trimming

Actions

Also available in: Atom PDF