Project

General

Profile

Actions

Bug #49666

open

RGW crash due to PerfCounters::inc assert_condition during multisite syncing

Added by Li Mingqiang about 3 years ago. Updated 8 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific quincy
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):

0bc60ae023b522915e327c8b597473a08dcebacd4919ab95d324734af6beb5f9
1a9a29cab818bf3ddb73afdd5fd0e12722532f58e2c30b6cd41009ee5dff8bd8
2bf1b3e02038e06d50abb448410d2c59001d10861a18e5c7cf1f3e8c1926b924
522618d0d09f6b8be5a4359dc5a3fd1a6a0fdc91222dbfafaa0fb64fbb451f4d
71c45779d7a35eb1c64c0b0fc55117d7dfe56010108a4e4558caa8b1fb50b130
7d6ca6057edf55e9e3dea0fd7cdcd6e4f11f13c4a5d00a883206d07a1e5fdae0
96fa452c3a27b0d721d4bcb9ea8bcde48f991b6458114a70aa5f815230a8c5b4
38def02c08847ca40126dcb976325e4ac3f145ce853aba51ac5f9fc21fc3ed23
fe60b48bad2cba6f3a9fa97c51ff29e211121819027b6d56b542ce49db14d06c


Description

ceph crash info 2021-03-04T07:48:01.822498Z_df599769-9947-476c-8ece-11f450d8c09f

{

"assert_condition": "idx > m_lower_bound",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.9/rpm/el8/BUILD/ceph-15.2.9/src/common/perf_counters.cc",
"assert_func": "void ceph::common::PerfCounters::inc(int, uint64_t)",
"assert_line": 164,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.9/rpm/el8/BUILD/ceph-15.2.9/src/common/perf_counters.cc: In function 'void ceph::common::PerfCounters::inc(int, uint64_t)' thread 7f77ad275700 time 2021-03-04T07:48:01.817568+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.9/rpm/el8/BUILD/ceph-15.2.9/src/common/perf_counters.cc: 164: FAILED ceph_assert(idx > m_lower_bound)\n",
"assert_thread_name": "rados_async",
"backtrace": [
"(()+0x12b20) [0x7f77c9290b20]",
"(gsignal()+0x10f) [0x7f77c78d57ff]",
"(abort()+0x127) [0x7f77c78bfc35]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f77c9e4382b]",
"(()+0x27a9f4) [0x7f77c9e439f4]",
"(()+0x465c3f) [0x7f77ca02ec3f]",
"(RGWAsyncFetchRemoteObj::_send_request()+0x3bc) [0x7f77d42b09cc]",
"(RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x24) [0x7f77d42ab114]",
"(RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0x11) [0x7f77d42b2a31]",
"(ThreadPool::worker(ThreadPool::WorkThread*)+0xe64) [0x7f77c9f30004]",
"(ThreadPool::WorkThread::entry()+0x15) [0x7f77c9f30865]",
"(()+0x814a) [0x7f77c928614a]",
"(clone()+0x43) [0x7f77c799af23]"
],
"ceph_version": "15.2.9",
"crash_id": "2021-03-04T07:48:01.822498Z_df599769-9947-476c-8ece-11f450d8c09f",
"entity_name": "client.rgw.realm_test.zone_first.ov-dapobject-02-3.jsjcum",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "radosgw",
"stack_sig": "0bc60ae023b522915e327c8b597473a08dcebacd4919ab95d324734af6beb5f9",
"timestamp": "2021-03-04T07:48:01.822498Z",
"utsname_hostname": "ov-dapobject-02-3",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-193.6.3.el8_2.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Jun 10 11:09:32 UTC 2020"

}


Files

radosgw_crashes.dump (14.8 KB) radosgw_crashes.dump Christian Rohmann, 10/11/2021 08:30 AM

Related issues 4 (1 open3 closed)

Has duplicate rgw - Bug #56832: crash: ceph::common::PerfCounters::inc(int, unsigned long)Resolved

Actions
Has duplicate rgw - Bug #51919: crash: ceph::common::PerfCounters::inc(int, unsigned long) (in RGWAsyncFetchRemoteObj::_send_request())Duplicate

Actions
Copied to rgw - Backport #57635: pacific: RGW crash due to PerfCounters::inc assert_condition during multisite syncingResolvedKonstantin ShalyginActions
Copied to rgw - Backport #57636: quincy: RGW crash due to PerfCounters::inc assert_condition during multisite syncingIn ProgressKonstantin ShalyginActions
Actions #1

Updated by Telemetry Bot almost 3 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.13, v15.2.8 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=96fa452c3a27b0d721d4bcb9ea8bcde48f991b6458114a70aa5f815230a8c5b4

Assert condition: idx > m_lower_bound
Assert function: void ceph::common::PerfCounters::inc(int, uint64_t)

Sanitized backtrace:

    RGWAsyncFetchRemoteObj::_send_request()
    RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)
    RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)
    ThreadPool::worker(ThreadPool::WorkThread*)
    ThreadPool::WorkThread::entry()
    clone()

Crash dump sample:
{
    "assert_condition": "idx > m_lower_bound",
    "assert_file": "common/perf_counters.cc",
    "assert_func": "void ceph::common::PerfCounters::inc(int, uint64_t)",
    "assert_line": 164,
    "assert_msg": "common/perf_counters.cc: In function 'void ceph::common::PerfCounters::inc(int, uint64_t)' thread 7fdc2ca57700 time 2021-07-02T12:44:41.725322+0200\ncommon/perf_counters.cc: 164: FAILED ceph_assert(idx > m_lower_bound)",
    "assert_thread_name": "rados_async",
    "backtrace": [
        "(()+0x12b30) [0x7fdc44025b30]",
        "(gsignal()+0x10f) [0x7fdc4266337f]",
        "(abort()+0x127) [0x7fdc4264ddb5]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7fdc44bd8d61]",
        "(()+0x27af2a) [0x7fdc44bd8f2a]",
        "(()+0x46724f) [0x7fdc44dc524f]",
        "(RGWAsyncFetchRemoteObj::_send_request()+0x3bc) [0x7fdc4f04859c]",
        "(RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x24) [0x7fdc4f042a74]",
        "(RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0x11) [0x7fdc4f04a5a1]",
        "(ThreadPool::worker(ThreadPool::WorkThread*)+0xe64) [0x7fdc44cc5d14]",
        "(ThreadPool::WorkThread::entry()+0x15) [0x7fdc44cc6575]",
        "(()+0x815a) [0x7fdc4401b15a]",
        "(clone()+0x43) [0x7fdc42728dd3]" 
    ],
    "ceph_version": "15.2.13",
    "crash_id": "2021-07-02T10:44:41.729757Z_3f04b2b4-5234-4d26-bb53-87bc28ad73ae",
    "entity_name": "client.9f5ba328f57e893aca80108d5e05c226d0071626",
    "os_id": "ol",
    "os_name": "Oracle Linux Server",
    "os_version": "8.4",
    "os_version_id": "8.4",
    "process_name": "radosgw",
    "stack_sig": "1a9a29cab818bf3ddb73afdd5fd0e12722532f58e2c30b6cd41009ee5dff8bd8",
    "timestamp": "2021-07-02T10:44:41.729757Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.17-2102.202.5.el8uek.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#2 SMP Sat May 22 16:16:03 PDT 2021" 
}

Actions #2

Updated by Christian Rohmann over 2 years ago

Setting up multisite on a former single sited RADOSGW setup / cluster we observed multiple RADOSGW crashes as well.
See the attached dumps of those crashes.

Actions #3

Updated by Christian Rohmann over 2 years ago

The issue appeared again around the time the machine was rebooed

# ceph crash info 2022-02-01T08:29:35.173777Z_1d4fc1eb-9f33-416f-a36d-1d335baaff27

{
    "backtrace": [
        "(()+0x46210) [0x7f3522309210]",
        "(ceph::common::PerfCounters::inc(int, unsigned long)+0x7) [0x7f351972c8b7]",
        "(RGWAsyncFetchRemoteObj::_send_request()+0x574) [0x7f3522cd6724]",
        "(RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x25) [0x7f3522cd0865]",
        "(RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0x11) [0x7f3522cd8b61]",
        "(ThreadPool::worker(ThreadPool::WorkThread*)+0x5bb) [0x7f351961d1bb]",
        "(ThreadPool::WorkThread::entry()+0x15) [0x7f351961e285]",
        "(()+0x9609) [0x7f3519179609]",
        "(clone()+0x43) [0x7f35223e5293]" 
    ],
    "ceph_version": "15.2.15",
    "crash_id": "2022-02-01T08:29:35.173777Z_1d4fc1eb-9f33-416f-a36d-1d335baaff27",
    "entity_name": "client.rgw.redacted",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.3 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "radosgw",
    "stack_sig": "2bf1b3e02038e06d50abb448410d2c59001d10861a18e5c7cf1f3e8c1926b924",
    "timestamp": "2022-02-01T08:29:35.173777Z",
    "utsname_hostname": "REDACTED",
    "utsname_machine": "x86_64",
    "utsname_release": "5.13.0-28-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022" 
}

Actions #4

Updated by Christian Rohmann over 2 years ago

Christian Rohmann wrote:

The issue appeared again around the time the machine was rebooed

[...]

Most likely during the stop of the RADOSGW.

Actions #5

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v15.2.15 added
Actions #6

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #7

Updated by Casey Bodley over 1 year ago

  • Has duplicate Bug #56832: crash: ceph::common::PerfCounters::inc(int, unsigned long) added
Actions #8

Updated by Casey Bodley over 1 year ago

  • Has duplicate Bug #51919: crash: ceph::common::PerfCounters::inc(int, unsigned long) (in RGWAsyncFetchRemoteObj::_send_request()) added
Actions #9

Updated by J. Eric Ivancich over 1 year ago

  • Status changed from New to Resolved
  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions #10

Updated by J. Eric Ivancich over 1 year ago

  • Pull request ID set to 48021
Actions #11

Updated by Soumya Koduri over 1 year ago

  • Backport set to pacific quincy
Actions #12

Updated by Casey Bodley over 1 year ago

  • Status changed from Resolved to Pending Backport
Actions #13

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57635: pacific: RGW crash due to PerfCounters::inc assert_condition during multisite syncing added
Actions #14

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57636: quincy: RGW crash due to PerfCounters::inc assert_condition during multisite syncing added
Actions #15

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #16

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.9, v17.2.5 added
Actions #17

Updated by Konstantin Shalygin 8 months ago

  • Assignee set to Soumya Koduri
  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
Actions

Also available in: Atom PDF