Project

General

Profile

Bug #49754

osd/OSD.cc: ceph_abort_msg("abort() called") during OSD::shutdown()

Added by Neha Ojha about 3 years ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-03-11T08:33:46.648 INFO:tasks.ceph.osd.7.gibba044.stderr:2021-03-11T08:33:46.642+0000 14c71700 -1 osd.7 574 *** Got signal Terminated ***
2021-03-11T08:33:46.859 INFO:tasks.ceph.osd.7.gibba044.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c71700 time 2021-03-11T08:33:46.844465+0000
2021-03-11T08:33:46.860 INFO:tasks.ceph.osd.7.gibba044.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")
2021-03-11T08:33:46.874 INFO:tasks.ceph.osd.7.gibba044.stderr: ceph version 17.0.0-1720-g42c01f0c (42c01f0c3932f7a9157797013e5db2e03d9c1fd9) quincy (dev)
2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x670168]
2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 2: (OSD::shutdown()+0x131e) [0x785e5e]
2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 3: (OSD::handle_signal(int)+0x124) [0x786254]
2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 4: (SignalHandler::entry()+0x5f8) [0xdadd58]
2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 5: (Thread::_entry_func(void*)+0xd) [0xde7a1d]
2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 6: /lib64/libpthread.so.0(+0x82de) [0xca2c2de]
2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 7: clone()

/a/nojha-2021-03-11_01:02:46-rados-wip-cc-weight-distro-basic-gibba/5954664

History

#1 Updated by Kefu Chai about 3 years ago

    -8> 2021-03-11T08:33:46.842+0000 14c71700 -1 osd.7 574 pgid 213.c has ref count of 8
    -7> 2021-03-11T08:33:46.872+0000 14c71700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/o
sd/OSD.cc: In function 'int OSD::shutdown()' thread 14c71700 time 2021-03-11T08:33:46.844465+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")

 ceph version 17.0.0-1720-g42c01f0c (42c01f0c3932f7a9157797013e5db2e03d9c1fd9) quincy (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x670168]
 2: (OSD::shutdown()+0x131e) [0x785e5e]
 3: (OSD::handle_signal(int)+0x124) [0x786254]
 4: (SignalHandler::entry()+0x5f8) [0xdadd58]
 5: (Thread::_entry_func(void*)+0xd) [0xde7a1d]
 6: /lib64/libpthread.so.0(+0x82de) [0xca2c2de]
 7: clone()

    -6> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.1e epoch_sent: 565 epoch_requested: 565 CheckReadable
    -5> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(6) _enqueue OpSchedulerItem(213.1e PGPeeringEvent(epoch_sent: 565 epoch_requested: 565 CheckReadable) prio 255 cost 10 e565)
    -4> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.9 epoch_sent: 555 epoch_requested: 555 RenewLease
    -3> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(1) _enqueue OpSchedulerItem(213.9 PGPeeringEvent(epoch_sent: 555 epoch_requested: 555 RenewLease) prio 255 cost 10 e555)
    -2> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.c epoch_sent: 555 epoch_requested: 555 RenewLease
    -1> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(4) _enqueue OpSchedulerItem(213.c PGPeeringEvent(epoch_sent: 555 epoch_requested: 555 RenewLease) prio 255 cost 10 e555)
     0> 2021-03-11T08:33:46.902+0000 14c71700 -1 *** Caught signal (Aborted) **

pg refcount leaked.

#2 Updated by Laura Flores almost 2 years ago

/a/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi061/crash/2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c/meta

{
    "crash_id": "2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c",
    "timestamp": "2022-04-26T01:06:35.101032Z",
    "process_name": "memcheck-amd64-",
    "entity_name": "osd.0",
    "ceph_version": "16.2.7-927-gaa0c7084",
    "utsname_hostname": "smithi061",
    "utsname_sysname": "Linux",
    "utsname_release": "4.18.0-383.el8.x86_64",
    "utsname_version": "#1 SMP Wed Apr 20 15:38:08 UTC 2022",
    "utsname_machine": "x86_64",
    "os_name": "CentOS Stream",
    "os_id": "centos",
    "os_version_id": "8",
    "os_version": "8",
    "assert_condition": "abort",
    "assert_func": "int OSD::shutdown()",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc",
    "assert_line": 4361,
    "assert_thread_name": "signal_handler",
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14a8a700 time 2022-04-26T01:06:34.999582+0000\n/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc: 4361: ceph_abort_msg(\"abort() called\")\n",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0xc843ce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x68811b]",
        "(OSD::shutdown()+0x131e) [0x7a040e]",
        "(OSD::handle_signal(int)+0x124) [0x7a0804]",
        "(SignalHandler::entry()+0x5f8) [0xddec58]",
        "/lib64/libpthread.so.0(+0x81cf) [0xc8391cf]",
        "clone()" 
    ]
}

#3 Updated by Laura Flores almost 2 years ago

  • Backport set to pacific

#4 Updated by Neha Ojha almost 2 years ago

Laura Flores wrote:

/a/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi061/crash/2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c/meta

[...]

https://tracker.ceph.com/issues/52124 was seen before this /ceph/teuthology-archive/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi142/log/valgrind

#5 Updated by Laura Flores over 1 year ago

/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939660

#6 Updated by Radoslaw Zarzynski over 1 year ago

int OSD::shutdown()
{
    // ...
    utime_t  start_time_umount = ceph_clock_now();
    store->prepare_for_fast_shutdown();
    std::lock_guard lock(osd_lock);
    // TBD: assert in allocator that nothing is being add
    store->umount();

    utime_t end_time = ceph_clock_now();
    if (cct->_conf->osd_fast_shutdown_timeout) {
      ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout);
    }
    dout(0) <<"Fast Shutdown duration total     :" << end_time              - start_time_func       << " seconds" << dendl;
    dout(0) <<"Fast Shutdown duration osd_drain :" << start_time_umount     - start_time_osd_drain  << " seconds" << dendl;
    dout(0) <<"Fast Shutdown duration umount    :" << end_time              - start_time_umount     << " seconds" << dendl;
    dout(0) <<"Fast Shutdown duration timer     :" << start_time_osd_drain  - start_time_timer      << " seconds" << dendl;

#7 Updated by Radoslaw Zarzynski over 1 year ago

Notes from the scrub:

1. It looks this happens mostly (only?) on pacific.
2. In at least of two replications Valgrind was clearly involved. Taking into that the assert validates walltime, this might be a test issue.

Also available in: Atom PDF