Bug #55566
closed[RGW-MS][DBR]:rgw crash observed in thread_name:sync-log-trim on the secondary, when 20M objects uploaded for test-bucket-1
0%
Description
rgw crashed in thread_name:sync-log-trim on the secondary when 20M objects uploaded for test-bucket-1
a snippet of backtrace:
=========================
[ceph: root@magna121 /]# ceph crash info 2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e
{
"assert_condition": "cursor",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc",
"assert_func": "virtual int PurgePeriodLogsCR::operate(const DoutPrefixProvider*)",
"assert_line": 94,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc: In function 'virtual int PurgePeriodLogsCR::operate(const DoutPrefixProvider*)' thread 7f067702c700 time 2022-05-05T13:59:21.519320+0000\n/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-10783-ge38464a1/rpm/el8/BUILD/ceph-17.0.0-10783-ge38464a1/src/rgw/rgw_trim_mdlog.cc: 94: FAILED ceph_assert(cursor)\n",
"assert_thread_name": "sync-log-trim",
"backtrace": [
"/lib64/libpthread.so.0(+0x12ce0) [0x7f06a3d7ece0]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f06a46f8452]",
"/usr/lib64/ceph/libceph-common.so.2(+0x283615) [0x7f06a46f8615]",
"(PurgePeriodLogsCR::operate(DoutPrefixProvider const*)+0xcb6) [0x7f06a7071ff6]",
"(RGWCoroutinesStack::operate(DoutPrefixProvider const*, RGWCoroutinesEnv*)+0x15c) [0x7f06a6ccf4fc]",
"(RGWCoroutinesManager::run(DoutPrefixProvider const*, std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x296) [0x7f06a6cd0336]",
"(RGWSyncLogTrimThread::process(DoutPrefixProvider const*)+0x23d) [0x7f06a6db5dad]",
"(RGWRadosThread::Worker::entry()+0x13a) [0x7f06a6d6029a]",
"/lib64/libpthread.so.0(+0x81cf) [0x7f06a3d741cf]",
"clone()"
],
"ceph_version": "17.0.0-10783-ge38464a1",
"crash_id": "2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e",
"entity_name": "client.rgw.usa.extensa033.uoxrfx",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "radosgw",
"stack_sig": "d59436d026384362dac809a48bf6b87371a937c4ca16641c4533dd4751976dc9",
"timestamp": "2022-05-05T13:59:21.524744Z",
"utsname_hostname": "extensa033",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-348.20.1.el8_5.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Tue Mar 8 12:56:54 EST 2022"
}
1. ceph version :
ceph version 17.0.0-12013-gf373d079
2. Steps:
a. create a bucket 'test-bucket-1' on the primary and trigger a 20M bi-directional workload [10M from each site.]
3. Result:
rgw crashes observed
4. Multisite configuration used:
- Multisite Configuration = No LB for ms-sync
- Total RGW daemons per zone = 14
- sync endpoints = 4 rgws (not in client IO)
- IO via = Haproxy ( client IO for 10 rgws)
- Object size = small size( 1- 10KB)
- Object PUT = bi-directional on same bucket
- objects per bucket = 20M (10M from each site simultaneously)
- Cluster utilization < 40% total cluster size
- IO tool = Cosbench
- Configs set on all rgws rgw_data_notify_interval_msec=0
debug_ms 0
debug_rgw 5
debug_rgw_sync 20
Updated by Vidushi Mishra almost 2 years ago
[ceph: root@magna121 /]# ceph crash ls
ID ENTITY NEW
2022-05-05T13:39:21.503250Z_a70de8c0-5065-48a8-9611-fa3a14fb766d client.rgw.usa.extensa031.ntwajy *
2022-05-05T13:59:21.524744Z_56f4546b-7d47-4bc2-95a5-7d76dd2d081e client.rgw.usa.extensa033.uoxrfx *
[ceph: root@magna121 /]#
Updated by Mark Kogan almost 2 years ago
duplicate (and fixed in) of https://tracker.ceph.com/issues/40341 -- multisite: failed assert(cursor) in mdlog trimming