Project

General

Profile

Actions

Bug #49237

closed

segv in AsyncConnection::_stop()

Added by Sage Weil about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: *** Caught signal (Segmentation fault) **
2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  in thread 7fca0e015700 thread_name:msgr-worker-0
2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  ceph version 17.0.0-681-gc1ea6241 (c1ea624123d412aff8b9d1430e36cb45fcab76b8) quincy (dev)
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7fca12004b20]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x555eec63c48c]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  3: (AsyncConnection::_stop()+0xab) [0x555eec63663b]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  4: (ProtocolV2::stop()+0x8f) [0x555eec66171f]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x555eec676e62]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x555eec6786ff]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  7: (ProtocolV2::handle_frame_payload()+0x20b) [0x555eec678d0b]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x555eec678f90]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x555eec679185]
2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x555eec679232]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x555eec67a381]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x555eec6625bc]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  13: (AsyncConnection::process()+0x789) [0x555eec6396d9]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x555eec489e37]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  15: /usr/bin/ceph-osd(+0xe8a95c) [0x555eec48d95c]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7fca11654ba3]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  17: /lib64/libpthread.so.0(+0x814a) [0x7fca11ffa14a]
2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  18: clone()
2021-02-10T04:43:28.392 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: debug 2021-02-10T04:43:28.279+0000 7fca0e015700 -1 *** Caught signal (Segmentation fault) **
2021-02-10T04:43:28.392 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]:  in thread 7fca0e015700 thread_name:msgr-worker-0

/a/sage-2021-02-09_22:53:38-rados:cephadm:thrash-wip-sage2-testing-2021-02-09-1332-distro-basic-smithi/5872150
/a/sage-2021-02-09_22:53:38-rados:cephadm:thrash-wip-sage2-testing-2021-02-09-1332-distro-basic-smithi/5872147

I also see reference to this bug in #44354


Files

crash.log (908 KB) crash.log Adrian Dabuleanu, 04/06/2021 01:16 PM

Related issues 5 (0 open5 closed)

Related to RADOS - Bug #49259: test_rados_api tests timeout with cephadm (plus extremely large OSD logs)ResolvedBrad Hubbard

Actions
Has duplicate RADOS - Bug #52176: crash: std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnectiDuplicate

Actions
Has duplicate RADOS - Bug #51527: Ceph osd crashed due to segfaultResolvedRadoslaw Zarzynski

Actions
Copied to Messengers - Backport #50482: octopus: segv in AsyncConnection::_stop()ResolvedActions
Copied to Messengers - Backport #50483: pacific: segv in AsyncConnection::_stop()ResolvedActions
Actions #1

Updated by Neha Ojha about 3 years ago

/a/yuriw-2021-02-09_22:48:58-rados-wip-yuri8-testing-2021-02-08-0950-distro-basic-smithi/5872137

rados/cephadm/with-work/{distro/ubuntu_18.04 fixed-2 mode/root mon_election/classic msgr/async start tasks/rados_api_tests}

Actions #2

Updated by Sage Weil about 3 years ago

/a/sage-2021-02-10_23:47:44-rados:cephadm:thrash-wip-sage2-testing-2021-02-10-1604-distro-basic-smithi/5873968
/a/sage-2021-02-10_23:47:44-rados:cephadm:thrash-wip-sage2-testing-2021-02-10-1604-distro-basic-smithi/5873971

seems to correspond to the async-v2only facet, e.g.

rados:cephadm:thrash/{0-distro/centos_8.0 1-start 2-thrash 3-tasks/snaps-few-objects fixed-2 msgr/async-v2only root}

Actions #3

Updated by Neha Ojha about 3 years ago

similar?

rados:/thrash-old-clients/{0-size-min-size-overrides/3-size-2-min-size 1-install/nautilus-v1only backoff/peering ceph clusters/{openstack three-plus-one} d-balancer/on distro$/{ubuntu_18.04} mon_election/connectivity msgr-failures/few rados thrashers/default thrashosds-health workloads/snaps-few-objects}

2021-02-11T03:00:17.182 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:16 smithi093 bash[24854]: *** Caught signal (Segmentation fault) **
2021-02-11T03:00:17.182 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:16 smithi093 bash[24854]:  in thread 7f86a8152700 thread_name:msgr-worker-1
2021-02-11T03:00:17.234 INFO:journalctl@ceph.mon.c.smithi186.stdout:Feb 11 03:00:16 smithi186 bash[12476]: cluster 2021-02-11T03:00:15.729696+0000 mgr.y (mgr.14140) 3189 : cluster [DBG] pgmap v4943: 47 pgs: 47 active+clean; 613 MiB data, 2.0 GiB used, 1.0 TiB / 1.0 TiB avail
2021-02-11T03:00:18.721 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  ceph version 17.0.0-703-gb4d9cc45 (b4d9cc45d6ff1ea5382954dece424128b478d6f7) quincy (dev)
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  1: /lib64/libpthread.so.0(+0x12b20) [0x7f86ac942b20]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5650e6de97cc]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  3: (AsyncConnection::_stop()+0xab) [0x5650e6de397b]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  4: (ProtocolV1::stop()+0x150) [0x5650e6e015f0]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  5: (ProtocolV1::replace(boost::intrusive_ptr<AsyncConnection> const&, ceph_msg_connect_reply&, ceph::buffer::v15_2_0::list&)+0x157) [0x5650e6e024a7]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  6: (ProtocolV1::handle_connect_message_2()+0x2936) [0x5650e6e05766]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  7: (ProtocolV1::handle_connect_message_auth(char*, int)+0x148) [0x5650e6e06f88]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  8: /usr/bin/ceph-osd(+0x10389bd) [0x5650e6de99bd]
2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  9: (AsyncConnection::process()+0x789) [0x5650e6de6a19]
2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  10: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5650e6c35d97]
2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  11: /usr/bin/ceph-osd(+0xe888bc) [0x5650e6c398bc]
2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f86abf92ba3]
2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  13: /lib64/libpthread.so.0(+0x814a) [0x7f86ac93814a]
2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]:  14: clone()

/a/nojha-2021-02-10_18:54:18-rados:-master-distro-basic-smithi/5873606

Actions #4

Updated by Sage Weil about 3 years ago

  • Status changed from New to Need More Info

https://github.com/ceph/ceph/pull/39482 reverts the cephadm container init change that triggered this regression.

Clearly something funny is going on so this should be investigated more carefully before re-merging the init change...

Actions #5

Updated by Sage Weil about 3 years ago

  • Related to Bug #49259: test_rados_api tests timeout with cephadm (plus extremely large OSD logs) added
Actions #6

Updated by alexandre derumier about 3 years ago

Hi,I have similar random osd crash since some montsh on octopus (I'm sure to have triggered it 15.2.4 - 15.2.8)

@root@ceph5-9:~# ceph crash info 2021-02-18T07:18:15.223807Z_5bbe94fe-466b-4de8-9037-3a0872916174 {
"backtrace": [
"(()+0x12730) [0x7fd381273730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5642d711d394]",
"(AsyncConnection::_stop()+0xa7) [0x5642d71179d7]",
"(ProtocolV2::stop()+0x8b) [0x5642d713f41b]",
"(ProtocolV2::_fault()+0x6b) [0x5642d713f59b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x5642d71555e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x5642d7140114]",
"(AsyncConnection::process()+0x79c) [0x5642d711a82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x5642d6f7e91d]",
"(()+0x11f41cb) [0x5642d6f841cb]",
"(()+0xbbb2f) [0x7fd381138b2f]",
"(()+0x7fa3) [0x7fd381268fa3]",
"(clone()+0x3f) [0x7fd380e164cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-02-18T07:18:15.223807Z_5bbe94fe-466b-4de8-9037-3a0872916174",
"entity_name": "osd.14",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-02-18T07:18:15.223807Z",
"utsname_hostname": "ceph5-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@

@root@ceph5-9:~# ceph crash info 2021-02-19T08:43:19.626268Z_ad9492f6-ba47-4cfc-b4c0-0e311376140e {
"backtrace": [
"(()+0x12730) [0x7fc180fe6730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5618e04fe394]",
"(AsyncConnection::_stop()+0xa7) [0x5618e04f89d7]",
"(ProtocolV2::stop()+0x8b) [0x5618e052041b]",
"(ProtocolV2::_fault()+0x6b) [0x5618e052059b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x5618e05365e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x5618e0521114]",
"(AsyncConnection::process()+0x79c) [0x5618e04fb82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x5618e035f91d]",
"(()+0x11f41cb) [0x5618e03651cb]",
"(()+0xbbb2f) [0x7fc180eabb2f]",
"(()+0x7fa3) [0x7fc180fdbfa3]",
"(clone()+0x3f) [0x7fc180b894cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-02-19T08:43:19.626268Z_ad9492f6-ba47-4cfc-b4c0-0e311376140e",
"entity_name": "osd.60",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-02-19T08:43:19.626268Z",
"utsname_hostname": "ceph5-9",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@

@root@ceph5-9:~# ceph crash info 2021-01-18T02:38:03.143317Z_dbc2f10d-26ae-4162-96da-78407c16d507 {
"backtrace": [
"(()+0x12730) [0x7f58610b1730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x55cee8fc1394]",
"(AsyncConnection::_stop()+0xa7) [0x55cee8fbb9d7]",
"(ProtocolV2::stop()+0x8b) [0x55cee8fe341b]",
"(ProtocolV2::_fault()+0x6b) [0x55cee8fe359b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x55cee8ff95e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55cee8fe4114]",
"(AsyncConnection::process()+0x79c) [0x55cee8fbe82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x55cee8e2291d]",
"(()+0x11f41cb) [0x55cee8e281cb]",
"(()+0xbbb2f) [0x7f5860f76b2f]",
"(()+0x7fa3) [0x7f58610a6fa3]",
"(clone()+0x3f) [0x7f5860c544cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-01-18T02:38:03.143317Z_dbc2f10d-26ae-4162-96da-78407c16d507",
"entity_name": "osd.6",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-01-18T02:38:03.143317Z",
"utsname_hostname": "ceph5-2.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-6-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20)"
}
@

@root@ceph5-9:~# ceph crash info 2021-01-10T10:45:39.605761Z_0870ac8f-5d76-4146-8f55-f412f0188944 {
"archived": "2021-01-11 09:00:44.916944",
"backtrace": [
"(()+0x12730) [0x7f7ec7f8d730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x55c55fd18394]",
"(AsyncConnection::_stop()+0xa7) [0x55c55fd129d7]",
"(ProtocolV2::stop()+0x8b) [0x55c55fd3a41b]",
"(ProtocolV2::_fault()+0x6b) [0x55c55fd3a59b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x55c55fd505e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55c55fd3b114]",
"(AsyncConnection::process()+0x79c) [0x55c55fd1582c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x55c55fb7991d]",
"(()+0x11f41cb) [0x55c55fb7f1cb]",
"(()+0xbbb2f) [0x7f7ec7e52b2f]",
"(()+0x7fa3) [0x7f7ec7f82fa3]",
"(clone()+0x3f) [0x7f7ec7b304cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-01-10T10:45:39.605761Z_0870ac8f-5d76-4146-8f55-f412f0188944",
"entity_name": "osd.57",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-01-10T10:45:39.605761Z",
"utsname_hostname": "ceph5-9",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@

This is non baremetal, no container, debian10

Actions #7

Updated by Sage Weil about 3 years ago

this is reliably triggered by rados/cephadm/thrash on centos/rhel nodes (ubuntu seems fine, strangely) when --init is passed to podman. unclear why adding a container init process makes this bug surface in qa....

Actions #8

Updated by Sage Weil about 3 years ago

  • Status changed from Need More Info to Fix Under Review
  • Pull request ID set to 39739
Actions #9

Updated by Sage Weil about 3 years ago

  • Status changed from Fix Under Review to Need More Info
  • Pull request ID deleted (39739)
Actions #10

Updated by Sage Weil about 3 years ago

  • Priority changed from Urgent to High

Sage Weil wrote:

this is reliably triggered by rados/cephadm/thrash on centos/rhel nodes (ubuntu seems fine, strangely) when --init is passed to podman. unclear why adding a container init process makes this bug surface in qa....

The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)

I was mostly triggering a busy reconnect loops when trying to reproduce, not the segv. So this msgr issue is still a real bug, but probably not one we're likely to hit easily.

Actions #11

Updated by Adrian Dabuleanu about 3 years ago

I have encountered the same issue on my production ceph cluster with multiple OSD crashing. I am running ceph 15.2.8 orchestrated by rook 1.5.4 on top of k8s 1.20.1 . I have attached the debug log. Is there a workaround to get past this issue?

Actions #12

Updated by Adrian Dabuleanu about 3 years ago

We rebooted the physical servers two days ago and the OSD seem to be fine. But today, they started crashing again with the same error, but at a smaller scale

2021-04-07T12:41:57.826500Z_c1b71737-f3aa-4ff2-b2f3-6cfd7eefc006  osd.43   *   
2021-04-07T12:42:58.478345Z_4b9e501b-fd28-4e34-9de9-3876d5cbfb47  osd.35   *   
2021-04-07T12:43:09.249149Z_4a4eaecd-77d0-4c8e-a4cd-bcae601cff3b  osd.30   *   
2021-04-07T12:43:22.754376Z_d087e282-3bfc-454e-91cf-e1f876913c47  osd.26   *   
2021-04-07T12:44:04.102748Z_32429126-d8a3-42b3-b715-923331ab6baa  osd.38   *   
2021-04-07T16:49:47.763038Z_4181a436-7875-45ad-9f79-fe087420fa92  osd.43   *   
2021-04-07T16:50:43.682655Z_319b3b52-f1c3-4403-9110-bedf38af33c6  osd.45   *   
2021-04-07T16:55:38.242908Z_49339a85-6f27-4086-b407-239fbd4b6989  osd.35   *   
2021-04-07T17:01:01.102548Z_4f084c78-a92d-4435-a25f-393fc67fd555  osd.41   *   
2021-04-07T17:01:10.771620Z_778426ca-9d81-4a3b-b836-6af3c3b639b7  osd.26   *   
2021-04-07T17:02:00.399641Z_9b9ca6e5-7bb4-4307-a7a2-53435f1828ef  osd.32   *   
2021-04-07T17:06:09.158782Z_0d4f1891-a7df-4bbe-9508-f7a7506c660d  osd.43   *   
2021-04-07T17:06:56.888059Z_6d40a03b-c874-48d2-98c4-18ebfdc02137  osd.37   *   
2021-04-07T19:32:16.772362Z_5d342b0b-7831-4ad2-ba04-06e44ba995af  osd.43   *   
2021-04-07T19:34:20.696933Z_12e57f20-91c8-451f-94ab-25a3082b1a12  osd.38   *   
2021-04-07T19:35:27.978435Z_a6b61afa-4cdf-4a1f-82af-8d454c914424  osd.35   * 

I want to understand what is causing this. Mr Sage Weil can you please give more details more on this comment? I want to understand how this is mapping to my 3 nodes k8s cluster.

The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)

Thanks,
Adrian

Actions #13

Updated by Sage Weil about 3 years ago

Adrian Dabuleanu wrote:

We rebooted the physical servers two days ago and the OSD seem to be fine. But today, they started crashing again with the same error, but at a smaller scale
[...]

I want to understand what is causing this. Mr Sage Weil can you please give more details more on this comment? I want to understand how this is mapping to my 3 nodes k8s cluster.

The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)

Thanks,
Adrian

Can you share teh output from 'ceph osd dump'? I'm curious if the ports are randomized or not (and whether this has the same cause as the issue I saw)

Actions #14

Updated by Adrian Dabuleanu about 3 years ago

Here is the output

epoch 49053
fsid dbb096d6-d67d-4319-a41b-e113a181c414
created 2021-01-09T13:41:43.929133+0000
modified 2021-04-08T13:44:13.584097+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 79
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release octopus
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 49035 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6980 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 25638 lfor 0/4968/4966 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6980 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 5 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 34566 lfor 0/5026/5024 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 48906 lfor 0/48906/48904 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'hdd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 15386 lfor 0/15386/15384 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
max_osd 49
osd.0 up   in  weight 1 up_from 33984 up_thru 48703 down_at 33971 last_clean_interval [30559,33965) [v2:10.40.136.19:6805/489468713,v1:10.40.136.19:6808/489468713] [v2:10.40.137.19:6806/489468713,v1:10.40.137.19:6808/489468713] exists,up d4df9d97-70a8-458a-9e8d-ebfb682d4265
osd.1 up   in  weight 1 up_from 34099 up_thru 48891 down_at 34082 last_clean_interval [30819,34078) [v2:10.40.136.30:6832/367310845,v1:10.40.136.30:6833/367310845] [v2:10.40.137.30:6832/367310845,v1:10.40.137.30:6833/367310845] exists,up 63ddfd79-7da9-4a29-aada-6e6de76c75f2
osd.2 up   in  weight 1 up_from 34064 up_thru 49006 down_at 34015 last_clean_interval [31075,34009) [v2:10.40.136.31:6824/2689213217,v1:10.40.136.31:6825/2689213217] [v2:10.40.137.31:6824/2689213217,v1:10.40.137.31:6825/2689213217] exists,up fbf8598b-f166-4d61-9a81-58b9e480e123
osd.3 up   in  weight 0.950012 up_from 33985 up_thru 48889 down_at 33966 last_clean_interval [33054,33965) [v2:10.40.136.19:6800/1657213001,v1:10.40.136.19:6801/1657213001] [v2:10.40.137.19:6800/1657213001,v1:10.40.137.19:6801/1657213001] exists,up d4960f2b-3eba-4163-8cf5-52d95cb08a9d
osd.4 up   in  weight 1 up_from 34098 up_thru 49023 down_at 34075 last_clean_interval [34056,34074) [v2:10.40.136.30:6801/4154683830,v1:10.40.136.30:6804/4154683830] [v2:10.40.137.30:6802/4154683830,v1:10.40.137.30:6803/4154683830] exists,up cf7854d2-f86f-47d0-9b98-47fc4b50053c
osd.5 up   in  weight 1 up_from 34064 up_thru 49011 down_at 34016 last_clean_interval [30843,34009) [v2:10.40.136.31:6826/348024490,v1:10.40.136.31:6827/348024490] [v2:10.40.137.31:6826/348024490,v1:10.40.137.31:6827/348024490] exists,up 0ba18c7b-9ff7-4a46-9c72-5ec62151ad97
osd.6 up   in  weight 1 up_from 33984 up_thru 48252 down_at 33969 last_clean_interval [29881,33965) [v2:10.40.136.19:6806/3441212294,v1:10.40.136.19:6810/3441212294] [v2:10.40.137.19:6807/3441212294,v1:10.40.137.19:6809/3441212294] exists,up e7e98f66-c126-4b62-b010-931d92b7ce6d
osd.7 up   in  weight 1 up_from 34098 up_thru 48906 down_at 34079 last_clean_interval [30692,34074) [v2:10.40.136.30:6802/308382208,v1:10.40.136.30:6806/308382208] [v2:10.40.137.30:6804/308382208,v1:10.40.137.30:6805/308382208] exists,up 5881d174-1a31-427c-ab24-46971ae54e7f
osd.8 up   in  weight 1 up_from 34064 up_thru 49012 down_at 34017 last_clean_interval [30707,34010) [v2:10.40.136.31:6848/2916991801,v1:10.40.136.31:6849/2916991801] [v2:10.40.137.31:6848/2916991801,v1:10.40.137.31:6849/2916991801] exists,up 5bb89940-a98f-40c3-b168-b9bc3e98f1b3
osd.9 up   in  weight 0.950012 up_from 33985 up_thru 49015 down_at 33968 last_clean_interval [29881,33965) [v2:10.40.136.19:6841/1189783200,v1:10.40.136.19:6844/1189783200] [v2:10.40.137.19:6841/1189783200,v1:10.40.137.19:6844/1189783200] exists,up 3fe3fd66-93b0-440f-993b-445453415fbd
osd.10 up   in  weight 1 up_from 34098 up_thru 49018 down_at 34080 last_clean_interval [30831,34076) [v2:10.40.136.30:6816/1545087195,v1:10.40.136.30:6817/1545087195] [v2:10.40.137.30:6816/1545087195,v1:10.40.137.30:6817/1545087195] exists,up ea0c0156-424a-4d54-8a35-3a0219ce1cef
osd.11 up   in  weight 1 up_from 34064 up_thru 49007 down_at 34014 last_clean_interval [30764,34008) [v2:10.40.136.31:6840/208244957,v1:10.40.136.31:6841/208244957] [v2:10.40.137.31:6840/208244957,v1:10.40.137.31:6841/208244957] exists,up bf730401-ee2b-4f15-88fb-149fa804ff12
osd.12 up   in  weight 1 up_from 33984 up_thru 48604 down_at 33970 last_clean_interval [30703,33966) [v2:10.40.136.19:6817/3440415969,v1:10.40.136.19:6819/3440415969] [v2:10.40.137.19:6817/3440415969,v1:10.40.137.19:6819/3440415969] exists,up 04008079-0230-4b09-be44-28c6b9fbfcd2
osd.13 up   in  weight 1 up_from 34099 up_thru 48911 down_at 34080 last_clean_interval [30711,34076) [v2:10.40.136.30:6836/1069316143,v1:10.40.136.30:6837/1069316143] [v2:10.40.137.30:6836/1069316143,v1:10.40.137.30:6837/1069316143] exists,up 03177899-807e-4e4f-9391-dfeaaa93b1ac
osd.14 up   in  weight 1 up_from 34064 up_thru 49008 down_at 34015 last_clean_interval [30139,34008) [v2:10.40.136.31:6800/1901072215,v1:10.40.136.31:6803/1901072215] [v2:10.40.137.31:6802/1901072215,v1:10.40.137.31:6803/1901072215] exists,up dda69eef-c998-4e32-bb03-9828066611a7
osd.15 up   in  weight 1 up_from 33986 up_thru 49016 down_at 33969 last_clean_interval [30823,33965) [v2:10.40.136.19:6852/1228464414,v1:10.40.136.19:6854/1228464414] [v2:10.40.137.19:6852/1228464414,v1:10.40.137.19:6854/1228464414] exists,up cf163e4d-73cf-44d4-989b-abac839f7f4f
osd.16 up   in  weight 1 up_from 34098 up_thru 48594 down_at 34080 last_clean_interval [27903,34075) [v2:10.40.136.30:6820/346991284,v1:10.40.136.30:6822/346991284] [v2:10.40.137.30:6820/346991284,v1:10.40.137.30:6822/346991284] exists,up 2cb45b04-bca0-40c1-a850-92b8b843c364
osd.17 up   in  weight 0.950012 up_from 34063 up_thru 48855 down_at 34015 last_clean_interval [30218,34008) [v2:10.40.136.31:6834/797694268,v1:10.40.136.31:6837/797694268] [v2:10.40.137.31:6835/797694268,v1:10.40.137.31:6837/797694268] exists,up 3b4ef426-e56b-47ea-a25f-5a281f6b0e4d
osd.18 up   in  weight 1 up_from 34098 up_thru 48145 down_at 34077 last_clean_interval [30215,34074) [v2:10.40.136.30:6805/736025612,v1:10.40.136.30:6808/736025612] [v2:10.40.137.30:6806/736025612,v1:10.40.137.30:6808/736025612] exists,up 2b53548a-d282-40ca-a1af-ffa1c8c5549b
osd.19 up   in  weight 0.599991 up_from 33984 up_thru 48898 down_at 33970 last_clean_interval [27907,33965) [v2:10.40.136.19:6835/955730844,v1:10.40.136.19:6837/955730844] [v2:10.40.137.19:6835/955730844,v1:10.40.137.19:6837/955730844] exists,up 131b4f47-0a7e-4fc3-a1ec-3fd6b0749ad7
osd.20 up   in  weight 1 up_from 34064 up_thru 48913 down_at 34012 last_clean_interval [30827,34008) [v2:10.40.136.31:6816/1081193690,v1:10.40.136.31:6817/1081193690] [v2:10.40.137.31:6816/1081193690,v1:10.40.137.31:6817/1081193690] exists,up ee19dac1-71d6-4e89-a75d-67ff61a9a873
osd.21 up   in  weight 1 up_from 34099 up_thru 48487 down_at 34077 last_clean_interval [30904,34074) [v2:10.40.136.30:6800/2151805991,v1:10.40.136.30:6803/2151805991] [v2:10.40.137.30:6800/2151805991,v1:10.40.137.30:6801/2151805991] exists,up 557353a4-d41c-427a-9602-b5df1c05f27b
osd.22 up   in  weight 1 up_from 33984 up_thru 49024 down_at 33971 last_clean_interval [30153,33965) [v2:10.40.136.19:6802/1448300047,v1:10.40.136.19:6803/1448300047] [v2:10.40.137.19:6802/1448300047,v1:10.40.137.19:6803/1448300047] exists,up 98fbdb05-a31e-4f65-a841-46a9dcc4acab
osd.23 up   in  weight 1 up_from 34064 up_thru 49020 down_at 34014 last_clean_interval [29790,34008) [v2:10.40.136.31:6801/2069623313,v1:10.40.136.31:6802/2069623313] [v2:10.40.137.31:6800/2069623313,v1:10.40.137.31:6801/2069623313] exists,up 133f7a8d-20d3-43d8-93fc-a41ea9c93a8b
osd.24 up   in  weight 1 up_from 34069 up_thru 48985 down_at 34013 last_clean_interval [33937,34010) [v2:10.40.136.31:6851/111935890,v1:10.40.136.31:6853/111935890] [v2:10.40.137.31:6851/111935890,v1:10.40.137.31:6853/111935890] exists,up 13ba14e8-ffa8-4328-a257-108dc8feccfd
osd.25 up   in  weight 1 up_from 33994 up_thru 48980 down_at 33969 last_clean_interval [33194,33965) [v2:10.40.136.19:6816/1578177584,v1:10.40.136.19:6818/1578177584] [v2:10.40.137.19:6816/1578177584,v1:10.40.137.19:6818/1578177584] exists,up 4eba66a4-15e5-49a2-8454-20340926cae1
osd.26 up   in  weight 1 up_from 44684 up_thru 48928 down_at 44682 last_clean_interval [36381,44681) [v2:10.40.136.30:6856/2667036643,v1:10.40.136.30:6857/2667036643] [v2:10.40.137.30:6856/2667036643,v1:10.40.137.30:6857/2667036643] exists,up 9ede428c-6c86-4955-bc6f-6f906364b7f5
osd.27 up   in  weight 1 up_from 34067 up_thru 48990 down_at 34011 last_clean_interval [30647,34008) [v2:10.40.136.31:6804/3247864981,v1:10.40.136.31:6806/3247864981] [v2:10.40.137.31:6804/3247864981,v1:10.40.137.31:6806/3247864981] exists,up 7967a10d-e3e9-4b90-9cbc-ed8e1fb34127
osd.28 up   in  weight 1 up_from 34002 up_thru 48976 down_at 33966 last_clean_interval [33568,33965) [v2:10.40.136.19:6848/1211909359,v1:10.40.136.19:6849/1211909359] [v2:10.40.137.19:6848/1211909359,v1:10.40.137.19:6849/1211909359] exists,up 5a18ef00-426f-4149-a8ac-6b83bc0057df
osd.29 up   in  weight 1 up_from 34106 up_thru 48940 down_at 34078 last_clean_interval [27917,34074) [v2:10.40.136.30:6821/191184276,v1:10.40.136.30:6823/191184276] [v2:10.40.137.30:6821/191184276,v1:10.40.137.30:6823/191184276] exists,up aa5d15ed-901b-4795-9891-c6806b6f8920
osd.30 up   in  weight 1 up_from 36374 up_thru 48962 down_at 36368 last_clean_interval [34067,36366) [v2:10.40.136.31:6842/2131009470,v1:10.40.136.31:6843/2131009470] [v2:10.40.137.31:6842/2131009470,v1:10.40.137.31:6843/2131009470] exists,up 610dcf6d-7651-497f-b279-a6d254a18c6b
osd.31 up   in  weight 1 up_from 33997 up_thru 48962 down_at 33966 last_clean_interval [30730,33965) [v2:10.40.136.19:6824/3965926459,v1:10.40.136.19:6825/3965926459] [v2:10.40.137.19:6824/3965926459,v1:10.40.137.19:6825/3965926459] exists,up 96711ba4-9b14-42ba-bfef-5b5e41c787ec
osd.32 up   in  weight 1 up_from 44704 up_thru 48932 down_at 44697 last_clean_interval [34114,44696) [v2:10.40.136.30:6860/98181250,v1:10.40.136.30:6861/98181250] [v2:10.40.137.30:6860/98181250,v1:10.40.137.30:6861/98181250] exists,up 8602ea9d-859d-4bfa-9a5a-d48626f3e2ce
osd.33 up   in  weight 1 up_from 34068 up_thru 48992 down_at 34010 last_clean_interval [33882,34008) [v2:10.40.136.31:6860/75616253,v1:10.40.136.31:6861/75616253] [v2:10.40.137.31:6860/75616253,v1:10.40.137.31:6861/75616253] exists,up b14294bc-f6fd-431c-82b4-504787d155c3
osd.34 up   in  weight 1 up_from 33998 up_thru 48994 down_at 33966 last_clean_interval [33268,33965) [v2:10.40.136.19:6853/4213686847,v1:10.40.136.19:6855/4213686847] [v2:10.40.137.19:6853/4213686847,v1:10.40.137.19:6855/4213686847] exists,up 20b1fc79-1699-4d0c-aa08-b20980830d5b
osd.35 up   in  weight 1 up_from 48980 up_thru 49000 down_at 48976 last_clean_interval [44535,48974) [v2:10.40.136.30:6828/796973594,v1:10.40.136.30:6829/796973594] [v2:10.40.137.30:6828/796973594,v1:10.40.137.30:6829/796973594] exists,up b315a268-f54e-44c7-bdf9-11a001c1c3c7
osd.36 up   in  weight 1 up_from 34068 up_thru 48976 down_at 34009 last_clean_interval [33745,34008) [v2:10.40.136.31:6812/357593772,v1:10.40.136.31:6813/357593772] [v2:10.40.137.31:6812/357593772,v1:10.40.137.31:6813/357593772] exists,up f63274bf-1ded-47bc-bf82-6ceddf20b97a
osd.37 up   in  weight 1 up_from 44846 up_thru 48999 down_at 44844 last_clean_interval [34000,44843) [v2:10.40.136.19:6834/2202899341,v1:10.40.136.19:6836/2202899341] [v2:10.40.137.19:6834/2202899341,v1:10.40.137.19:6836/2202899341] exists,up 8455b9d3-6e96-491a-89e6-75f2a9c3f6ba
osd.38 up   in  weight 1 up_from 48962 up_thru 48978 down_at 48958 last_clean_interval [36400,48957) [v2:10.40.136.30:6838/2897105189,v1:10.40.136.30:6840/2897105189] [v2:10.40.137.30:6838/2897105189,v1:10.40.137.30:6840/2897105189] exists,up 6171592f-727e-44aa-84ce-755010f7a6e3
osd.39 up   in  weight 1 up_from 34069 up_thru 48976 down_at 34012 last_clean_interval [28274,34008) [v2:10.40.136.31:6820/425484265,v1:10.40.136.31:6821/425484265] [v2:10.40.137.31:6820/425484265,v1:10.40.137.31:6821/425484265] exists,up 37450001-e084-4ad9-9aa1-0e0f8be99b57
osd.40 up   in  weight 1 up_from 33991 up_thru 48976 down_at 33969 last_clean_interval [28664,33965) [v2:10.40.136.19:6828/1352966624,v1:10.40.136.19:6829/1352966624] [v2:10.40.137.19:6828/1352966624,v1:10.40.137.19:6829/1352966624] exists,up 23c854cc-a235-4fd9-8b60-b09820a8c148
osd.41 up   in  weight 1 up_from 44685 up_thru 48952 down_at 44680 last_clean_interval [34106,44678) [v2:10.40.136.30:6850/125657673,v1:10.40.136.30:6852/125657673] [v2:10.40.137.30:6850/125657673,v1:10.40.137.30:6853/125657673] exists,up 7e0b1ad7-c6c0-421f-b1a9-d926e572c9d1
osd.43 up   in  weight 1 up_from 48928 up_thru 48997 down_at 48926 last_clean_interval [44832,48925) [v2:10.40.136.19:6860/2439838026,v1:10.40.136.19:6861/2439838026] [v2:10.40.137.19:6860/2439838026,v1:10.40.137.19:6861/2439838026] exists,up a9c558e1-b5ed-432f-b927-ac13ee201af9
osd.44 up   in  weight 1 up_from 34109 up_thru 44864 down_at 34076 last_clean_interval [33890,34074) [v2:10.40.136.30:6846/3516264091,v1:10.40.136.30:6847/3516264091] [v2:10.40.137.30:6846/3516264091,v1:10.40.137.30:6847/3516264091] exists,up a78ff880-01e7-4fb5-8835-838272340946
osd.45 up   in  weight 1 up_from 44396 up_thru 48980 down_at 44393 last_clean_interval [34068,44392) [v2:10.40.136.31:6850/3605419931,v1:10.40.136.31:6852/3605419931] [v2:10.40.137.31:6850/3605419931,v1:10.40.137.31:6852/3605419931] exists,up 3bcadd7c-ab5e-416c-a084-b9653be2704b
osd.46 up   in  weight 1 up_from 34109 up_thru 48926 down_at 34078 last_clean_interval [28315,34074) [v2:10.40.136.30:6844/1547041860,v1:10.40.136.30:6845/1547041860] [v2:10.40.137.30:6844/1547041860,v1:10.40.137.30:6845/1547041860] exists,up 48eadccf-e669-4a38-afea-724ec30ad2dc
osd.47 up   in  weight 1 up_from 34067 up_thru 48980 down_at 34016 last_clean_interval [27926,34009) [v2:10.40.136.31:6832/1440913495,v1:10.40.136.31:6833/1440913495] [v2:10.40.137.31:6832/1440913495,v1:10.40.137.31:6833/1440913495] exists,up 60f4b16e-24a0-4c6d-aecd-c5af45925d1e
osd.48 up   in  weight 1 up_from 33989 up_thru 48996 down_at 33970 last_clean_interval [27911,33965) [v2:10.40.136.19:6831/678817177,v1:10.40.136.19:6833/678817177] [v2:10.40.137.19:6831/678817177,v1:10.40.137.19:6833/678817177] exists,up ce49568e-3f26-451e-a370-ed6ab8c717a9
pg_upmap_items 6.3 [11,14]
pg_upmap_items 6.5 [9,15]
pg_upmap_items 6.7 [5,8]
pg_upmap_items 6.a [19,22]
pg_upmap_items 6.1a [19,6]
pg_upmap_items 6.20 [9,15]
pg_upmap_items 6.25 [9,15]
pg_upmap_items 6.2c [19,6,5,8]
pg_upmap_items 6.31 [21,16]
pg_upmap_items 6.35 [19,6]
pg_upmap_items 6.39 [9,15]
pg_upmap_items 6.58 [9,15]
pg_upmap_items 6.5e [5,8]
pg_upmap_items 6.69 [19,6]
pg_upmap_items 6.6d [9,15]
pg_upmap_items 6.7c [19,6]
pg_upmap_items 7.2 [26,35]
pg_upmap_items 7.b [36,39]
pg_upmap_items 7.c [44,41]
pg_upmap_items 7.10 [36,39]
pg_upmap_items 7.11 [46,32]
pg_upmap_items 7.12 [26,32]
pg_upmap_items 7.14 [26,41]
pg_upmap_items 7.19 [36,24]
pg_upmap_items 7.1e [24,45]
pg_upmap_items 7.1f [38,32]
pg_upmap_items 7.24 [48,31]
pg_upmap_items 7.2c [36,27]
pg_upmap_items 7.2d [46,35]
pg_upmap_items 7.34 [36,27]
pg_upmap_items 7.37 [24,47]
pg_upmap_items 7.3b [38,41]
pg_upmap_items 7.3f [46,41]
pg_temp 7.fe [44,48,39]
blacklist 10.40.136.31:0/1543267051 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6865/1735026091 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6864/1735026091 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:0/1120460670 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6865/2542377202 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/3055030675 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/376350552 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:6864/2542377202 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/2796177981 expires 2021-04-09T10:07:44.091402+0000

Actions #15

Updated by Sage Weil about 3 years ago

  • Priority changed from High to Urgent
Actions #16

Updated by Sage Weil about 3 years ago

can you try setting log_to_file=true, debug_ms=0/20, and reproducing the crash? If you can send us the resulting log file that includes a crash?

Actions #17

Updated by Adrian Dabuleanu about 3 years ago

Had another 2 OSD crash today. Here are the OSD logs (crash-osd-34.log and crash-osd-37.log) https://drive.google.com/drive/folders/1NiGiujjKw-wIXOnne2dERH7Q8zoXQBQC , after applying the debug log that you provided. I could not attach it to this ticket, because they are around 10 Mb each.

Please let me know if you need other logs.

Thanks,
Adrian

Actions #18

Updated by Sage Weil about 3 years ago

  • Status changed from Need More Info to Fix Under Review

As before, it looks like 2 problems here:

1. this crash itself. i think i see the locking bug.
2. the peer is in a reconnect loop. in my case I saw this because of the nonce issue; I'm not sure why you are seeing it. :/ Fixing (1) will prevent a crash but won't address the underlying issue...

Actions #19

Updated by Sage Weil about 3 years ago

  • Backport set to pacific,octopus
  • Pull request ID set to 40912
Actions #20

Updated by Adrian Dabuleanu about 3 years ago

After observing the crash thread we found out that a possible cause of crashes is related to RAM issues. During crashes we have some OSD crash due to OOM

$ dmesg -T
[Wed Apr 21 09:09:26 2021] ceph-osd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=1000
[Wed Apr 21 09:09:26 2021] CPU: 34 PID: 1983779 Comm: ceph-osd Kdump: loaded Tainted: G          I      --------- -  - 4.18.0-240.15.1.el8_3.x86_64 #1
[Wed Apr 21 09:09:26 2021] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS 1.5.6 10/17/2018
[Wed Apr 21 09:09:26 2021] Call Trace:
[Wed Apr 21 09:09:26 2021]  dump_stack+0x5c/0x80
[Wed Apr 21 09:09:26 2021]  dump_header+0x51/0x308
[Wed Apr 21 09:09:26 2021]  ? try_to_free_pages+0xe8/0x1c0
[Wed Apr 21 09:09:26 2021]  oom_kill_process.cold.28+0xb/0x10
[Wed Apr 21 09:09:26 2021]  out_of_memory+0x1c1/0x4b0
[Wed Apr 21 09:09:26 2021]  __alloc_pages_slowpath+0xc24/0xd40
[Wed Apr 21 09:09:26 2021]  __alloc_pages_nodemask+0x245/0x280
[Wed Apr 21 09:09:26 2021]  filemap_fault+0x3b8/0x840
[Wed Apr 21 09:09:26 2021]  ? hrtimer_try_to_cancel+0x25/0x100
[Wed Apr 21 09:09:26 2021]  ? _cond_resched+0x15/0x30
[Wed Apr 21 09:09:26 2021]  __xfs_filemap_fault+0x6d/0x200 [xfs]
[Wed Apr 21 09:09:26 2021]  __do_fault+0x38/0xc0
[Wed Apr 21 09:09:26 2021]  do_fault+0x191/0x3c0
[Wed Apr 21 09:09:26 2021]  __handle_mm_fault+0x3e6/0x7c0
[Wed Apr 21 09:09:26 2021]  handle_mm_fault+0xc2/0x1d0
[Wed Apr 21 09:09:26 2021]  __do_page_fault+0x21b/0x4d0
[Wed Apr 21 09:09:26 2021]  do_page_fault+0x32/0x110
[Wed Apr 21 09:09:26 2021]  ? page_fault+0x8/0x30
[Wed Apr 21 09:09:26 2021]  page_fault+0x1e/0x30
[Wed Apr 21 09:09:26 2021] RIP: 0033:0x7f4b30a5564a
[Wed Apr 21 09:09:26 2021] Code: Bad RIP value.
[Wed Apr 21 09:09:26 2021] RSP: 002b:00007f4b2aa63570 EFLAGS: 00010246
[Wed Apr 21 09:09:26 2021] RAX: ffffffffffffff92 RBX: 0000556b5462a1e8 RCX: 00007f4b30a5564a
[Wed Apr 21 09:09:26 2021] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 0000556b5462a214
[Wed Apr 21 09:09:26 2021] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[Wed Apr 21 09:09:26 2021] R10: 00007f4b2aa63680 R11: 0000000000000246 R12: 0000556b5462a1c0
[Wed Apr 21 09:09:26 2021] R13: 0000556b5462a214 R14: 00007f4b2aa63680 R15: 0000000000000000
[Wed Apr 21 09:09:26 2021] Mem-Info:
[Wed Apr 21 09:09:26 2021] active_anon:48316112 inactive_anon:8718 isolated_anon:0
                            active_file:64 inactive_file:990 isolated_file:41
                            unevictable:0 dirty:0 writeback:0 unstable:0
                            slab_reclaimable:94846 slab_unreclaimable:310927
                            mapped:6771 shmem:9771 pagetables:106620 bounce:0
                            free:105307 free_pcp:57 free_cma:0
[Wed Apr 21 09:09:26 2021] Node 0 active_anon:95578536kB inactive_anon:1356kB active_file:0kB inactive_file:652kB unevictable:0kB isolated(anon):0kB isolated(file):28kB mapped:0kB dirty:0kB writeback:0kB shmem:2636kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 51200kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Apr 21 09:09:26 2021] Node 1 active_anon:97685912kB inactive_anon:33516kB active_file:256kB inactive_file:3308kB unevictable:0kB isolated(anon):0kB isolated(file):136kB mapped:27084kB dirty:0kB writeback:0kB shmem:36448kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Apr 21 09:09:26 2021] Node 0 DMA free:15552kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15552kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 1362 87617 87617 87617
[Wed Apr 21 09:09:26 2021] Node 0 DMA32 free:345236kB min:684kB low:2076kB high:3468kB active_anon:1056416kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1755968kB managed:1427384kB mlocked:0kB kernel_stack:208kB pagetables:180kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 86254 86254 86254
[Wed Apr 21 09:09:26 2021] Node 0 Normal free:16260kB min:43504kB low:131828kB high:220152kB active_anon:94522120kB inactive_anon:1356kB active_file:0kB inactive_file:652kB unevictable:0kB writepending:0kB present:97517568kB managed:88324448kB mlocked:0kB kernel_stack:20376kB pagetables:188656kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 0 0 0
[Wed Apr 21 09:09:26 2021] Node 1 Normal free:44180kB min:45908kB low:139108kB high:232308kB active_anon:97685912kB inactive_anon:33516kB active_file:256kB inactive_file:3308kB unevictable:0kB writepending:0kB present:100663296kB managed:93208884kB mlocked:0kB kernel_stack:24936kB pagetables:237644kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 0 0 0
[Wed Apr 21 09:09:26 2021] Node 0 DMA: 0*4kB 2*8kB (U) 3*16kB (U) 2*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15552kB
[Wed Apr 21 09:09:26 2021] Node 0 DMA32: 414*4kB (MEH) 711*8kB (UMEH) 792*16kB (UMH) 507*32kB (UMEH) 346*64kB (UMEH) 166*128kB (UMH) 46*256kB (UMEH) 14*512kB (UME) 9*1024kB (UM) 4*2048kB (UE) 56*4096kB (UM) = 345360kB
[Wed Apr 21 09:09:26 2021] Node 0 Normal: 1479*4kB (UMEH) 188*8kB (UME) 105*16kB (UME) 200*32kB (UEH) 4*64kB (H) 2*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16268kB
[Wed Apr 21 09:09:26 2021] Node 1 Normal: 2452*4kB (UMH) 1434*8kB (UMEH) 1157*16kB (UMEH) 220*32kB (UEH) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46896kB
[Wed Apr 21 09:09:26 2021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Apr 21 09:09:26 2021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Apr 21 09:09:26 2021] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Apr 21 09:09:26 2021] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Apr 21 09:09:26 2021] 11213 total pagecache pages
[Wed Apr 21 09:09:26 2021] 0 pages in swap cache
[Wed Apr 21 09:09:26 2021] Swap cache stats: add 0, delete 0, find 0/0
[Wed Apr 21 09:09:26 2021] Free swap  = 0kB
[Wed Apr 21 09:09:26 2021] Total swap = 0kB
[Wed Apr 21 09:09:26 2021] 49988207 pages RAM
[Wed Apr 21 09:09:26 2021] 0 pages HighMem/MovableOnly
[Wed Apr 21 09:09:26 2021] 4244140 pages reserved
[Wed Apr 21 09:09:26 2021] 0 pages hwpoisoned
[Wed Apr 21 09:09:26 2021] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Apr 21 09:09:26 2021] [ 1584]     0  1584    37101    12093   344064        0             0 systemd-journal
[Wed Apr 21 09:09:26 2021] [ 1621]     0  1621    29538      608   225280        0         -1000 systemd-udevd
[Wed Apr 21 09:09:26 2021] [ 2136]     0  2136    41174      196   188416        0         -1000 auditd
[Wed Apr 21 09:09:26 2021] [ 2138]     0  2138    12130       90   139264        0             0 sedispatch
[Wed Apr 21 09:09:26 2021] [ 2166]   998  2166   508458     1763   393216        0             0 polkitd
[Wed Apr 21 09:09:26 2021] [ 2169]     0  2169     4437       37    65536        0             0 mcelog
[Wed Apr 21 09:09:26 2021] [ 2170]    81  2170    19159      201   167936        0          -900 dbus-daemon
[Wed Apr 21 09:09:26 2021] [ 2173]     0  2173    53699      507   430080        0             0 sssd
[Wed Apr 21 09:09:26 2021] [ 2177]   997  2177     4928       39    69632        0             0 lsmd
[Wed Apr 21 09:09:26 2021] [ 2179]     0  2179    31315      234   143360        0             0 irqbalance
[Wed Apr 21 09:09:26 2021] [ 2183]     0  2183    12759      422   139264        0             0 smartd
[Wed Apr 21 09:09:26 2021] [ 2184]   989  2184    95327      210   233472        0             0 rngd
[Wed Apr 21 09:09:26 2021] [ 2214]   990  2214    32228      134   159744        0             0 chronyd
[Wed Apr 21 09:09:26 2021] [ 2252]     0  2252    55312      640   430080        0             0 sssd_be
[Wed Apr 21 09:09:26 2021] [ 2283]     0  2283    56216      418   466944        0             0 sssd_nss
[Wed Apr 21 09:09:26 2021] [ 2303]     0  2303    20976      257   196608        0             0 systemd-logind
[Wed Apr 21 09:09:26 2021] [ 2939]     0  2939    23072      224   192512        0         -1000 sshd
[Wed Apr 21 09:09:26 2021] [ 2940]     0  2940   106588     3764   434176        0             0 tuned
[Wed Apr 21 09:09:26 2021] [ 2941]     0  2941  1504994    23199  1138688        0          -999 kubelet
[Wed Apr 21 09:09:26 2021] [ 2942]     0  2942    66858     5534   282624        0             0 rsyslogd
[Wed Apr 21 09:09:26 2021] [ 2952]     0  2952  1372921    10680   888832        0          -999 containerd
[Wed Apr 21 09:09:26 2021] [ 2955]     0  2955     9232      221   106496        0             0 crond
[Wed Apr 21 09:09:26 2021] [ 2956]     0  2956    10994       51   118784        0             0 atd
[Wed Apr 21 09:09:26 2021] [ 2998]     0  2998     3408       28    61440        0             0 agetty
[Wed Apr 21 09:09:26 2021] [ 3064]     0  3064  1432928    26985  1171456        0          -999 dockerd
[Wed Apr 21 09:09:26 2021] [ 6602]     0  6602    28280      327    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6603]     0  6603    27992      320    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6604]     0  6604    27992      260    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6605]     0  6605    27992      292    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6678]     0  6678      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6680]     0  6680      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6688]     0  6688      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6699]     0  6699      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6767]     0  6767    27992      279    77824        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6802]     0  6802    27992      282    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6821]     0  6821      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6830]     0  6830    28008      350    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6843]     0  6843      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 6867]     0  6867    28344      340    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6886]     0  6886    27992      336    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6892]     0  6892    27992      290    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6912]     0  6912    27992      318    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6947]     0  6947    27992      259    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6995]     0  6995    28360      353    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7007]     0  7007    28344      336    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7044]     0  7044    28344      313    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7057]     0  7057    28360      308    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7086]     0  7086      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7131]     0  7131    27992      344    77824        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7132]     0  7132    27944      269    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7133]     0  7133    27928      262    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7156]     0  7156    28344      265    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7203]     0  7203    28296      324    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7214]     0  7214      242        1    24576        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7219]     0  7219      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7225]     0  7225      242        1    32768        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7266]     0  7266      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7267]     0  7267      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7293]     0  7293      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7294]     0  7294      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7295]     0  7295      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7312]     0  7312      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7321]     0  7321      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7322]     0  7322      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7343]     0  7343      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7345]     0  7345      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7346]     0  7346      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 7455]     0  7455    28360      338    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7458]     0  7458    28008      269    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7600]     0  7600    28008      333    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7635]     0  7635   178751      758   110592        0          1000 csi-node-driver
[Wed Apr 21 09:09:26 2021] [ 7649]     0  7649   178751      734   114688        0          1000 csi-node-driver
[Wed Apr 21 09:09:26 2021] [ 7656]     0  7656   187811     3370   212992        0          -999 kube-proxy
[Wed Apr 21 09:09:26 2021] [ 7768]     0  7768    28344      269    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7801]     0  7801    28344      348    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7866]     0  7866  1090469     2576   757760        0          1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 7874]     0  7874  1071844     2649   745472        0          1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8349]     0  8349    27928      301    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8449]     0  8449    28344      313    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8561]     0  8561   850073     2535   643072        0          1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8599]     0  8599   831831     2370   630784        0          1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8942]     0  8942    27992      343    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8982]     0  8982    27992      343    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8997] 65534  8997      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 9011]     0  9011      242        1    28672        0          -998 pause
[Wed Apr 21 09:09:26 2021] [ 9055]     0  9055    28360      326    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 9094] 65534  9094   181328     2714   180224        0          1000 node_exporter
[Wed Apr 21 09:09:26 2021] [10526]     0 10526    27992      307    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [10593]     0 10593   979402     2553   561152        0          -997 flanneld
[Wed Apr 21 09:09:26 2021] [10723]     0 10723    28344      780    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [10812]   167 10812  4440418  4029129 34652160        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [10945]     0 10945    28344      781    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11034]     0 11034    28344      727    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11052]   167 11052  3212853  2558465 24850432        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11139]   167 11139  4439593  4079185 34639872        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11227]     0 11227    28344      749    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11278]     0 11278    28344      709    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11305]   167 11305  2546593  2171110 19484672        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11364]   167 11364   434890   241374  2949120        0          1000 ceph-mon
[Wed Apr 21 09:09:26 2021] [11537]     0 11537    27944      315    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11573]     0 11573    11140     1527   135168        0          1000 ceph-crash
[Wed Apr 21 09:09:26 2021] [11613]     0 11613    28344      680    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11633]   167 11633  4295048  3911305 33476608        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [1966072]     0 1966072    28344      730    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [1966092]   167 1966092  2536442  1917360 19382272        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [1983663]     0 1983663    28344      595    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [1983692]   167 1983692  2349325  1870377 17883136        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2003377]     0 2003377    28344      761    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2003396]   167 2003396  2422957  1897019 18501632        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2004557]     0 2004557    28344      670    94208        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2004583]   167 2004583  2823392  2399948 21696512        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2009557]     0 2009557    28344      625    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2009578]   167 2009578  2466486  2022868 18841600        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2010358]     0 2010358    28344      625    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2010378]   167 2010378  3031970  2597166 23359488        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3265497]     0 3265497    28344      505    86016        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3265519]   167 3265519  4062035  3793136 31653888        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3266802]     0 3266802    28344      687    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3266822]   167 3266822  4497716  4230869 35123200        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3270975]     0 3270975    28344      807    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3270994]   167 3270994  3967122  3729950 30855168        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3274123]     0 3274123    28344      657    90112        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3274144]   167 3274144  4494897  4206707 35098624        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3755838]     0 3755838    28344      443    81920        0          -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3755859]   167 3755859  2725350  2521164 20897792        0          1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [112604]     0 112604   123433     1400   159744        0          -998 runc
[Wed Apr 21 09:09:26 2021] [112605]     0 112605     5978     1275    73728        0          -998 runc
[Wed Apr 21 09:09:26 2021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=d87a901cec4c109cd26e6832847d1a22b4e61f4b7563c93c2db0294a5f0ba81e,mems_allowed=0-1,global_oom,task_memcg=/kubepods/besteffort/pod9883fed2-c3d8-47d5-9f69-2e6b7176bc13/dde4e8592a3ee00e7e1d523095d8d1939593d01f6ca5fc5b921588f5a7f5808c,task=ceph-osd,pid=3266822,uid=167
[Wed Apr 21 09:09:26 2021] Out of memory: Killed process 3266822 (ceph-osd) total-vm:17990864kB, anon-rss:16923476kB, file-rss:0kB, shmem-rss:0kB, UID:167
[Wed Apr 21 09:09:28 2021] oom_reaper: reaped process 3266822 (ceph-osd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Wed Apr 21 09:09:32 2021] iptables[113272]: segfault at 88 ip 00007fb815b80e47 sp 00007ffd77560418 error 4 in libnftnl.so.11.3.0[7fb815b7c000+16000]
[Wed Apr 21 09:09:32 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:20:02 2021] iptables[135288]: segfault at 88 ip 00007f5b7bddee47 sp 00007ffc20ab5188 error 4 in libnftnl.so.11.3.0[7f5b7bdda000+16000]
[Wed Apr 21 09:20:02 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:46:51 2021] iptables[191860]: segfault at 88 ip 00007f8860230e47 sp 00007fffc4f15388 error 4 in libnftnl.so.11.3.0[7f886022c000+16000]
[Wed Apr 21 09:46:51 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:48:12 2021] iptables[194703]: segfault at 88 ip 00007fb5fd607e47 sp 00007ffff6459678 error 4 in libnftnl.so.11.3.0[7fb5fd603000+16000]
[Wed Apr 21 09:48:12 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:11:36 2021] iptables[243752]: segfault at 88 ip 00007f90505a3e47 sp 00007ffd1f4a9f38 error 4 in libnftnl.so.11.3.0[7f905059f000+16000]
[Wed Apr 21 10:11:36 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:24:18 2021] iptables[270483]: segfault at 88 ip 00007f378dd25e47 sp 00007fffb363c858 error 4 in libnftnl.so.11.3.0[7f378dd21000+16000]
[Wed Apr 21 10:24:18 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:28:24 2021] IPv6: ADDRCONF(NETDEV_UP): veth3f1f1570: link is not ready
[Wed Apr 21 10:28:24 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth3f1f1570: link becomes ready
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered blocking state
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 10:28:24 2021] device veth3f1f1570 entered promiscuous mode
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered blocking state
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered forwarding state
[Wed Apr 21 10:28:26 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 10:28:26 2021] device veth3f1f1570 left promiscuous mode
[Wed Apr 21 10:28:26 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 11:25:59 2021] iptables[400663]: segfault at 88 ip 00007fce8ba41e47 sp 00007ffca2b13c08 error 4 in libnftnl.so.11.3.0[7fce8ba3d000+16000]
[Wed Apr 21 11:25:59 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74

Could this be the underlying issue that you are referring to?

Thanks,
Adrian

Actions #21

Updated by Kefu Chai about 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #22

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50482: octopus: segv in AsyncConnection::_stop() added
Actions #23

Updated by Backport Bot about 3 years ago

  • Copied to Backport #50483: pacific: segv in AsyncConnection::_stop() added
Actions #24

Updated by Adrian Dabuleanu almost 3 years ago

Hi,

Any news when this bugfix will be available in a ceph version?

Thanks,
Adrian

Actions #25

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52176: crash: std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnecti added
Actions #26

Updated by alexandre derumier over 2 years ago

Hi,
I had 2 crash today, pacific 16.2.6 (I thinked it was fixed in this version ? or is it another bug ?)

{
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f21a34f6140]",
"(AsyncMessenger::unregister_conn(boost::intrusive_ptr<AsyncConnection> const&)+0x70) [0x55f9fb9ebdf0]",
"(AsyncConnection::_stop()+0x5a) [0x55f9fb9e466a]",
"(ProtocolV2::stop()+0x8d) [0x55f9fba0fe2d]",
"(ProtocolV2::_fault()+0x1ab) [0x55f9fba1010b]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x55f9fba10a71]",
"(AsyncConnection::process()+0x511) [0x55f9fb9e8321]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x141) [0x55f9fb8195f1]",
"/usr/bin/ceph-osd(+0x13da062) [0x55f9fb81f062]",
"/lib/x86_64-linux-gnu/libstdc
+.so.6(+0xceed0) [0x7f21a3379ed0]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f21a34eaea7]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-09-30T17:47:12.895238Z_57fa9a38-72fc-48d9-bef0-db850a52e848",
"entity_name": "osd.4",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-osd",
"stack_sig": "15a9fc1118d0f904bb1aa31fd4ea165498353da0ef33f672252c702653a09b72",
"timestamp": "2021-09-30T17:47:12.895238Z",
"utsname_hostname": "mindceph1-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-8-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.46-4 (2021-08-03)"
} {
"archived": "2021-10-01 13:42:01.114671",
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f1655d42140]",
"(AsyncMessenger::unregister_conn(boost::intrusive_ptr<AsyncConnection> const&)+0x70) [0x55e27c310df0]",
"(AsyncConnection::_stop()+0x5a) [0x55e27c30966a]",
"(ProtocolV2::stop()+0x8d) [0x55e27c334e2d]",
"(ProtocolV2::_fault()+0x1ab) [0x55e27c33510b]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x55e27c335a71]",
"(AsyncConnection::process()+0x511) [0x55e27c30d321]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x141) [0x55e27c13e5f1]",
"/usr/bin/ceph-osd(+0x13da062) [0x55e27c144062]",
"/lib/x86_64-linux-gnu/libstdc
+.so.6(+0xceed0) [0x7f1655bc5ed0]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f1655d36ea7]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-09-30T09:17:13.122241Z_769ca0cc-a96c-4c5a-a624-87030b22c98f",
"entity_name": "osd.4",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-osd",
"stack_sig": "15a9fc1118d0f904bb1aa31fd4ea165498353da0ef33f672252c702653a09b72",
"timestamp": "2021-09-30T09:17:13.122241Z",
"utsname_hostname": "mindceph1-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-8-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.46-4 (2021-08-03)"
9

ceph crash info 2021-09-30T09:17:13.122241Z_769ca0cc-a96c-4c5a-a624-87030b22c98f
Actions #27

Updated by Radoslaw Zarzynski over 2 years ago

  • Has duplicate Bug #51527: Ceph osd crashed due to segfault added
Actions #28

Updated by Loïc Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF