Project

General

Profile

Actions

Bug #54397

open

RadosGW Crash on Start

Added by Christian Kugler about 2 years ago. Updated 12 months ago.

Status:
Triaged
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
startup services
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):

252a97ceebcd30e1396dbdf2f6872a71c40bcfa4a58e684a1ddcb07a1248237c
5ba32bd2dae20e6377ebba271c5bfaaef8d2f7fdce2467c88d25c2bf64fd7fca
89af7ee16dcaec4566fbb4c4c6c79d25d9f81901435eb76eacb3c7ee8e92064f


Description

Some of our gateways crash if restartet via systemctl restart ceph-radosgw@rgw07.service. After trying several times they started again. Restarting it again can work but not always.
I found a similar sounding crash: https://tracker.ceph.com/issues/52744

2022-02-24T10:13:36.378+0100 7f8ff0951700 -1 received  signal: Terminated from /sbin/init  (PID: 1) UID: 0
2022-02-24T10:13:36.378+0100 7f8ff0951700  1 handle_sigterm
2022-02-24T10:13:36.378+0100 7f8ff0951700  1 handle_sigterm set alarm for 120
2022-02-24T10:13:36.378+0100 7f8ff452c540 -1 shutting down
2022-02-24T10:15:06.514+0100 7ff2c58a1540  0 deferred set uid:gid to 64045:64045 (ceph:ceph)
2022-02-24T10:15:06.514+0100 7ff2c58a1540  0 ceph version 15.2.15 (efc1e7ef13525f02d58d97ccd62fec325323c618) octopus (stable), process radosgw, pid 990334
2022-02-24T10:15:06.514+0100 7ff2c58a1540  0 framework: beast
2022-02-24T10:15:06.514+0100 7ff2c58a1540  0 framework conf key: port, val: 7480
2022-02-24T10:15:06.514+0100 7ff2c58a1540  1 radosgw_Main not setting numa affinity
2022-02-24T10:15:16.754+0100 7ff2c58a1540  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):394 Notify failed on object .log:meta.history: (110) Connection timed out
2022-02-24T10:15:16.754+0100 7ff2c58a1540  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):397 Backtrace: :  ceph version 15.2.15 (efc1e7ef13525f02d58d97ccd62fec325323c618) octopus (stable)
 1: (RGWSI_Notify::distribute(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWCacheNotifyInfo const&, optional_yield)+0xa7) [0x7ff2d0846fc7]
 2: (RGWSI_SysObj_Cache::distribute_cache(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw_raw_obj const&, ObjectCacheInfo&, int, optional_yield)+0x378) [0x7ff2d08536f8]
 3: (RGWSI_SysObj_Cache::write(rgw_raw_obj const&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >&, bool, ceph::buffer::v15_2_0::list const&, RGWObjVersionTracker*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield)+0x39c) [0x7ff2d0855bec]
 4: (RGWSI_SysObj::Obj::WOp::write(ceph::buffer::v15_2_0::list&, optional_yield)+0x3a) [0x7ff2d03c18fa]
 5: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&, bool, RGWObjVersionTracker*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, optional_yield, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x11f) [0x7ff2d07770df]
 6: (rgw_put_system_obj(RGWSysObjectCtx&, rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&, bool, RGWObjVersionTracker*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > >*)+0x19) [0x7ff2d0777169]
 7: (RGWSI_MDLog::write_history(RGWMetadataLogHistory const&, RGWObjVersionTracker*, bool)+0x139) [0x7ff2d03bace9]
 8: (RGWSI_MDLog::init_oldest_log_period()+0x6c5) [0x7ff2d03bd825]
 9: (RGWSI_MDLog::do_start()+0xec) [0x7ff2d03bdedc]
 10: (RGWServiceInstance::start()+0x1a) [0x7ff2d03e5c7a]
 11: (RGWServices_Def::init(ceph::common::CephContext*, bool, bool, bool)+0xab3) [0x7ff2d03e76f3]
 12: (RGWServices::do_init(ceph::common::CephContext*, bool, bool, bool)+0x1a) [0x7ff2d03e915a]
 13: (RGWRados::initialize()+0xed) [0x7ff2d06daf8d]
 14: (RGWStoreManager::init_storage_provider(ceph::common::CephContext*, bool, bool, bool, bool, bool, bool)+0x9f) [0x7ff2d076ea7f]
 15: (radosgw_Main(int, char const**)+0xdde) [0x7ff2d0389cee]
 16: (__libc_start_main()+0xea) [0x7ff2cfc9dd0a]
 17: (_start()+0x2a) [0x55aab100509a]

2022-02-24T10:15:16.754+0100 7ff2c58a1540  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):413 Invalidating obj=.log:meta.history tries=0
2022-02-24T10:15:26.758+0100 7ff2c58a1540  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 invalidation attempt 0 failed: (110) Connection timed out
2022-02-24T10:15:26.758+0100 7ff2c58a1540  1 int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):413 Invalidating obj=.log:meta.history tries=1

Files

rgw logs.rtf (141 KB) rgw logs.rtf Peter Razumovsky, 02/14/2023 03:15 PM

Related issues 8 (1 open7 closed)

Related to rgw - Bug #51911: crash: RGWSI_Notify::unwatchNew

Actions
Has duplicate rgw - Bug #54739: crash: RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long)Duplicate

Actions
Has duplicate rgw - Bug #54722: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0)Duplicate

Actions
Has duplicate rgw - Bug #54719: crash: RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long)Duplicate

Actions
Has duplicate rgw - Bug #54699: crash: RGWSI_Notify::finalize_watch()Duplicate

Actions
Has duplicate rgw - Bug #54694: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0)Duplicate

Actions
Has duplicate rgw - Bug #54683: crash: tcmalloc::allocate_full_cpp_throw_oom(unsigned long)Duplicate

Actions
Has duplicate rgw - Bug #54637: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0)Duplicate

Actions
Actions #1

Updated by Christian Kugler about 2 years ago

I tried civetweb as well as beast

Actions #2

Updated by Casey Bodley about 2 years ago

rgw's initialization failed because of this error:

int RGWSI_Notify::robust_notify(RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):394 Notify failed on object .log:meta.history: (110) Connection timed out

it shouldn't crash here, but the error would prevent a successful startup anyway

Actions #3

Updated by Casey Bodley about 2 years ago

  • Related to Bug #51911: crash: RGWSI_Notify::unwatch added
Actions #4

Updated by Casey Bodley about 2 years ago

  • Status changed from New to Triaged
  • Assignee set to Mark Kogan
  • Tags set to startup services
Actions #5

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54739: crash: RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long) added
Actions #6

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54722: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0) added
Actions #7

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54719: crash: RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long) added
Actions #8

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54699: crash: RGWSI_Notify::finalize_watch() added
Actions #9

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54694: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0) added
Actions #10

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54683: crash: tcmalloc::allocate_full_cpp_throw_oom(unsigned long) added
Actions #11

Updated by Casey Bodley about 2 years ago

  • Has duplicate Bug #54637: crash: virtual ceph::common::RefCountedObject::~RefCountedObject(): assert(nref == 0) added
Actions #12

Updated by Tony Fontenay over 1 year ago

Hello,

I encounter the same problem in Pacific v16.2.10. Is there a workaround to get the container started.

2022-12-19T10:48:51.825+0000 7fc2e7da05c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object fr-ws.rgw.log:meta.history: (110) Connection timed out
2022-12-19T10:48:51.825+0000 7fc2e7da05c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=fr-ws.rgw.log:meta.history tries=0
2022-12-19T10:49:01.829+0000 7fc2e7da05c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 0 failed: (110) Connection timed out
2022-12-19T10:49:01.829+0000 7fc2e7da05c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=fr-ws.rgw.log:meta.history tries=1
2022-12-19T10:53:40.905+0000 7fc2d3242700 -1 Initialization timeout, failed to initialize
Actions #13

Updated by Tony Fontenay over 1 year ago

Tony Fontenay wrote:

Hello,

I encounter the same problem in Pacific v16.2.10. Is there a workaround to get the container started.

[...]

Ultimately, After several attempts, the container was able to start

Actions #14

Updated by Peter Razumovsky about 1 year ago

Faced the same issue on v16.2.11:

debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific (stable), process radosgw, pid 1
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 framework: beast
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 framework conf key: port, val: 80
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 framework conf key: ssl_port, val: 8443
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
debug 2023-02-14T15:10:41.558+0000 7f3f28ed85c0  1 radosgw_Main not setting numa affinity
debug 2023-02-14T15:10:51.973+0000 7f3f28ed85c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):402 Notify failed on object openstack-store.rgw.log:meta.history: (110) Connection timed out
debug 2023-02-14T15:10:51.973+0000 7f3f28ed85c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=openstack-store.rgw.log:meta.history tries=0
debug 2023-02-14T15:11:01.977+0000 7f3f28ed85c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):423 invalidation attempt 0 failed: (110) Connection timed out
debug 2023-02-14T15:11:01.977+0000 7f3f28ed85c0  1 rgw main: int RGWSI_Notify::robust_notify(const DoutPrefixProvider*, RGWSI_RADOS::Obj&, const RGWCacheNotifyInfo&, optional_yield):418 Invalidating obj=openstack-store.rgw.log:meta.history tries=1

after restart logs are in the attached file.

Actions #15

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.6, v16.2.7, v17.2.5, v17.2.6 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=0f3fccb24792d2a29286127ebdae5b502ccfa9dd0e6c15beae112744c4a88fa9

Sanitized backtrace:

    RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long)
    RGWWatcher::reinit()
    Context::complete(int)
    Finisher::finisher_thread_entry()

Crash dump sample:
{
    "archived": "2023-04-26 15:27:58.174335",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f8f9b1f8cf0]",
        "/lib64/librados.so.2(+0xa523a) [0x7f8f9ea5523a]",
        "/lib64/librados.so.2(+0xaf938) [0x7f8f9ea5f938]",
        "(RGWSI_Notify::unwatch(RGWSI_RADOS::Obj&, unsigned long)+0x30) [0x7f8f9dfbf520]",
        "(RGWWatcher::reinit()+0x3a) [0x7f8f9dfc653a]",
        "(Context::complete(int)+0xd) [0x7f8f9d86963d]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7f8f9bbf956d]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f8f9b1ee1ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-04-26T15:20:28.684265Z_8cf2caba-3178-4615-9916-7df375f669ab",
    "entity_name": "client.2d8b922bc596d7a7534981e8107ac04735617f84",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "radosgw",
    "stack_sig": "252a97ceebcd30e1396dbdf2f6872a71c40bcfa4a58e684a1ddcb07a1248237c",
    "timestamp": "2023-04-26T15:20:28.684265Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.17.0-1020-oem",
    "utsname_sysname": "Linux",
    "utsname_version": "#21-Ubuntu SMP PREEMPT Fri Oct 14 09:33:24 UTC 2022" 
}

Actions

Also available in: Atom PDF