Bug #49754
osd/OSD.cc: ceph_abort_msg("abort() called") during OSD::shutdown()
0%
Description
2021-03-11T08:33:46.648 INFO:tasks.ceph.osd.7.gibba044.stderr:2021-03-11T08:33:46.642+0000 14c71700 -1 osd.7 574 *** Got signal Terminated *** 2021-03-11T08:33:46.859 INFO:tasks.ceph.osd.7.gibba044.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c71700 time 2021-03-11T08:33:46.844465+0000 2021-03-11T08:33:46.860 INFO:tasks.ceph.osd.7.gibba044.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called") 2021-03-11T08:33:46.874 INFO:tasks.ceph.osd.7.gibba044.stderr: ceph version 17.0.0-1720-g42c01f0c (42c01f0c3932f7a9157797013e5db2e03d9c1fd9) quincy (dev) 2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x670168] 2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 2: (OSD::shutdown()+0x131e) [0x785e5e] 2021-03-11T08:33:46.875 INFO:tasks.ceph.osd.7.gibba044.stderr: 3: (OSD::handle_signal(int)+0x124) [0x786254] 2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 4: (SignalHandler::entry()+0x5f8) [0xdadd58] 2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 5: (Thread::_entry_func(void*)+0xd) [0xde7a1d] 2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 6: /lib64/libpthread.so.0(+0x82de) [0xca2c2de] 2021-03-11T08:33:46.876 INFO:tasks.ceph.osd.7.gibba044.stderr: 7: clone()
/a/nojha-2021-03-11_01:02:46-rados-wip-cc-weight-distro-basic-gibba/5954664
History
#1 Updated by Kefu Chai about 3 years ago
-8> 2021-03-11T08:33:46.842+0000 14c71700 -1 osd.7 574 pgid 213.c has ref count of 8 -7> 2021-03-11T08:33:46.872+0000 14c71700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/o sd/OSD.cc: In function 'int OSD::shutdown()' thread 14c71700 time 2021-03-11T08:33:46.844465+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-1720-g42c01f0c/rpm/el8/BUILD/ceph-17.0.0-1720-g42c01f0c/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called") ceph version 17.0.0-1720-g42c01f0c (42c01f0c3932f7a9157797013e5db2e03d9c1fd9) quincy (dev) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x670168] 2: (OSD::shutdown()+0x131e) [0x785e5e] 3: (OSD::handle_signal(int)+0x124) [0x786254] 4: (SignalHandler::entry()+0x5f8) [0xdadd58] 5: (Thread::_entry_func(void*)+0xd) [0xde7a1d] 6: /lib64/libpthread.so.0(+0x82de) [0xca2c2de] 7: clone() -6> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.1e epoch_sent: 565 epoch_requested: 565 CheckReadable -5> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(6) _enqueue OpSchedulerItem(213.1e PGPeeringEvent(epoch_sent: 565 epoch_requested: 565 CheckReadable) prio 255 cost 10 e565) -4> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.9 epoch_sent: 555 epoch_requested: 555 RenewLease -3> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(1) _enqueue OpSchedulerItem(213.9 PGPeeringEvent(epoch_sent: 555 epoch_requested: 555 RenewLease) prio 255 cost 10 e555) -2> 2021-03-11T08:33:46.894+0000 359d3700 15 osd.7 574 enqueue_peering_evt 213.c epoch_sent: 555 epoch_requested: 555 RenewLease -1> 2021-03-11T08:33:46.894+0000 359d3700 20 osd.7 op_wq(4) _enqueue OpSchedulerItem(213.c PGPeeringEvent(epoch_sent: 555 epoch_requested: 555 RenewLease) prio 255 cost 10 e555) 0> 2021-03-11T08:33:46.902+0000 14c71700 -1 *** Caught signal (Aborted) **
pg refcount leaked.
#2 Updated by Laura Flores almost 2 years ago
/a/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi061/crash/2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c/meta
{
"crash_id": "2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c",
"timestamp": "2022-04-26T01:06:35.101032Z",
"process_name": "memcheck-amd64-",
"entity_name": "osd.0",
"ceph_version": "16.2.7-927-gaa0c7084",
"utsname_hostname": "smithi061",
"utsname_sysname": "Linux",
"utsname_release": "4.18.0-383.el8.x86_64",
"utsname_version": "#1 SMP Wed Apr 20 15:38:08 UTC 2022",
"utsname_machine": "x86_64",
"os_name": "CentOS Stream",
"os_id": "centos",
"os_version_id": "8",
"os_version": "8",
"assert_condition": "abort",
"assert_func": "int OSD::shutdown()",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc",
"assert_line": 4361,
"assert_thread_name": "signal_handler",
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14a8a700 time 2022-04-26T01:06:34.999582+0000\n/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7-927-gaa0c7084/rpm/el8/BUILD/ceph-16.2.7-927-gaa0c7084/src/osd/OSD.cc: 4361: ceph_abort_msg(\"abort() called\")\n",
"backtrace": [
"/lib64/libpthread.so.0(+0x12ce0) [0xc843ce0]",
"gsignal()",
"abort()",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x68811b]",
"(OSD::shutdown()+0x131e) [0x7a040e]",
"(OSD::handle_signal(int)+0x124) [0x7a0804]",
"(SignalHandler::entry()+0x5f8) [0xddec58]",
"/lib64/libpthread.so.0(+0x81cf) [0xc8391cf]",
"clone()"
]
}
#3 Updated by Laura Flores almost 2 years ago
- Backport set to pacific
#4 Updated by Neha Ojha almost 2 years ago
Laura Flores wrote:
/a/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi061/crash/2022-04-26T01:06:35.101032Z_c80ccf92-f148-4bbf-8305-9d93042e095c/meta
[...]
https://tracker.ceph.com/issues/52124 was seen before this /ceph/teuthology-archive/yuriw-2022-04-26_00:11:14-rados-wip-55324-pacific-backport-distro-default-smithi/6805265/remote/smithi142/log/valgrind
#5 Updated by Laura Flores over 1 year ago
/a/yuriw-2022-07-19_23:25:12-rados-wip-yuri2-testing-2022-07-15-0755-pacific-distro-default-smithi/6939660
#6 Updated by Radoslaw Zarzynski over 1 year ago
int OSD::shutdown() { // ... utime_t start_time_umount = ceph_clock_now(); store->prepare_for_fast_shutdown(); std::lock_guard lock(osd_lock); // TBD: assert in allocator that nothing is being add store->umount(); utime_t end_time = ceph_clock_now(); if (cct->_conf->osd_fast_shutdown_timeout) { ceph_assert(end_time - start_time_func < cct->_conf->osd_fast_shutdown_timeout); } dout(0) <<"Fast Shutdown duration total :" << end_time - start_time_func << " seconds" << dendl; dout(0) <<"Fast Shutdown duration osd_drain :" << start_time_umount - start_time_osd_drain << " seconds" << dendl; dout(0) <<"Fast Shutdown duration umount :" << end_time - start_time_umount << " seconds" << dendl; dout(0) <<"Fast Shutdown duration timer :" << start_time_osd_drain - start_time_timer << " seconds" << dendl;
#7 Updated by Radoslaw Zarzynski over 1 year ago
Notes from the scrub:
1. It looks this happens mostly (only?) on pacific.
2. In at least of two replications Valgrind was clearly involved. Taking into that the assert validates walltime, this might be a test issue.