Bug #49237
closedsegv in AsyncConnection::_stop()
0%
Description
2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: *** Caught signal (Segmentation fault) ** 2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: in thread 7fca0e015700 thread_name:msgr-worker-0 2021-02-10T04:43:28.384 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: ceph version 17.0.0-681-gc1ea6241 (c1ea624123d412aff8b9d1430e36cb45fcab76b8) quincy (dev) 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 1: /lib64/libpthread.so.0(+0x12b20) [0x7fca12004b20] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x555eec63c48c] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 3: (AsyncConnection::_stop()+0xab) [0x555eec63663b] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 4: (ProtocolV2::stop()+0x8f) [0x555eec66171f] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 5: (ProtocolV2::handle_existing_connection(boost::intrusive_ptr<AsyncConnection> const&)+0x742) [0x555eec676e62] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 6: (ProtocolV2::handle_client_ident(ceph::buffer::v15_2_0::list&)+0xeef) [0x555eec6786ff] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 7: (ProtocolV2::handle_frame_payload()+0x20b) [0x555eec678d0b] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x555eec678f90] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 9: (ProtocolV2::_handle_read_frame_epilogue_main()+0x95) [0x555eec679185] 2021-02-10T04:43:28.385 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 10: (ProtocolV2::_handle_read_frame_segment()+0x92) [0x555eec679232] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 11: (ProtocolV2::handle_read_frame_segment(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x201) [0x555eec67a381] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 12: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x3c) [0x555eec6625bc] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 13: (AsyncConnection::process()+0x789) [0x555eec6396d9] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 14: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x555eec489e37] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 15: /usr/bin/ceph-osd(+0xe8a95c) [0x555eec48d95c] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 16: /lib64/libstdc++.so.6(+0xc2ba3) [0x7fca11654ba3] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 17: /lib64/libpthread.so.0(+0x814a) [0x7fca11ffa14a] 2021-02-10T04:43:28.386 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: 18: clone() 2021-02-10T04:43:28.392 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: debug 2021-02-10T04:43:28.279+0000 7fca0e015700 -1 *** Caught signal (Segmentation fault) ** 2021-02-10T04:43:28.392 INFO:journalctl@ceph.osd.0.smithi013.stdout:Feb 10 04:43:28 smithi013 conmon[45768]: in thread 7fca0e015700 thread_name:msgr-worker-0
/a/sage-2021-02-09_22:53:38-rados:cephadm:thrash-wip-sage2-testing-2021-02-09-1332-distro-basic-smithi/5872150
/a/sage-2021-02-09_22:53:38-rados:cephadm:thrash-wip-sage2-testing-2021-02-09-1332-distro-basic-smithi/5872147
I also see reference to this bug in #44354
Files
Updated by Neha Ojha about 3 years ago
/a/yuriw-2021-02-09_22:48:58-rados-wip-yuri8-testing-2021-02-08-0950-distro-basic-smithi/5872137
rados/cephadm/with-work/{distro/ubuntu_18.04 fixed-2 mode/root mon_election/classic msgr/async start tasks/rados_api_tests}
Updated by Sage Weil about 3 years ago
/a/sage-2021-02-10_23:47:44-rados:cephadm:thrash-wip-sage2-testing-2021-02-10-1604-distro-basic-smithi/5873968
/a/sage-2021-02-10_23:47:44-rados:cephadm:thrash-wip-sage2-testing-2021-02-10-1604-distro-basic-smithi/5873971
seems to correspond to the async-v2only facet, e.g.
rados:cephadm:thrash/{0-distro/centos_8.0 1-start 2-thrash 3-tasks/snaps-few-objects fixed-2 msgr/async-v2only root}
Updated by Neha Ojha about 3 years ago
similar?
rados:/thrash-old-clients/{0-size-min-size-overrides/3-size-2-min-size 1-install/nautilus-v1only backoff/peering ceph clusters/{openstack three-plus-one} d-balancer/on distro$/{ubuntu_18.04} mon_election/connectivity msgr-failures/few rados thrashers/default thrashosds-health workloads/snaps-few-objects}
2021-02-11T03:00:17.182 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:16 smithi093 bash[24854]: *** Caught signal (Segmentation fault) ** 2021-02-11T03:00:17.182 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:16 smithi093 bash[24854]: in thread 7f86a8152700 thread_name:msgr-worker-1 2021-02-11T03:00:17.234 INFO:journalctl@ceph.mon.c.smithi186.stdout:Feb 11 03:00:16 smithi186 bash[12476]: cluster 2021-02-11T03:00:15.729696+0000 mgr.y (mgr.14140) 3189 : cluster [DBG] pgmap v4943: 47 pgs: 47 active+clean; 613 MiB data, 2.0 GiB used, 1.0 TiB / 1.0 TiB avail 2021-02-11T03:00:18.721 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: ceph version 17.0.0-703-gb4d9cc45 (b4d9cc45d6ff1ea5382954dece424128b478d6f7) quincy (dev) 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 1: /lib64/libpthread.so.0(+0x12b20) [0x7f86ac942b20] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 2: (std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x2c) [0x5650e6de97cc] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 3: (AsyncConnection::_stop()+0xab) [0x5650e6de397b] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 4: (ProtocolV1::stop()+0x150) [0x5650e6e015f0] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 5: (ProtocolV1::replace(boost::intrusive_ptr<AsyncConnection> const&, ceph_msg_connect_reply&, ceph::buffer::v15_2_0::list&)+0x157) [0x5650e6e024a7] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 6: (ProtocolV1::handle_connect_message_2()+0x2936) [0x5650e6e05766] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 7: (ProtocolV1::handle_connect_message_auth(char*, int)+0x148) [0x5650e6e06f88] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 8: /usr/bin/ceph-osd(+0x10389bd) [0x5650e6de99bd] 2021-02-11T03:00:22.840 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 9: (AsyncConnection::process()+0x789) [0x5650e6de6a19] 2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 10: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xcb7) [0x5650e6c35d97] 2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 11: /usr/bin/ceph-osd(+0xe888bc) [0x5650e6c398bc] 2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 12: /lib64/libstdc++.so.6(+0xc2ba3) [0x7f86abf92ba3] 2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 13: /lib64/libpthread.so.0(+0x814a) [0x7f86ac93814a] 2021-02-11T03:00:22.841 INFO:journalctl@ceph.osd.2.smithi093.stdout:Feb 11 03:00:18 smithi093 bash[24854]: 14: clone()
/a/nojha-2021-02-10_18:54:18-rados:-master-distro-basic-smithi/5873606
Updated by Sage Weil about 3 years ago
- Status changed from New to Need More Info
https://github.com/ceph/ceph/pull/39482 reverts the cephadm container init change that triggered this regression.
Clearly something funny is going on so this should be investigated more carefully before re-merging the init change...
Updated by Sage Weil about 3 years ago
- Related to Bug #49259: test_rados_api tests timeout with cephadm (plus extremely large OSD logs) added
Updated by alexandre derumier about 3 years ago
Hi,I have similar random osd crash since some montsh on octopus (I'm sure to have triggered it 15.2.4 - 15.2.8)
@root@ceph5-9:~# ceph crash info 2021-02-18T07:18:15.223807Z_5bbe94fe-466b-4de8-9037-3a0872916174
{
"backtrace": [
"(()+0x12730) [0x7fd381273730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5642d711d394]",
"(AsyncConnection::_stop()+0xa7) [0x5642d71179d7]",
"(ProtocolV2::stop()+0x8b) [0x5642d713f41b]",
"(ProtocolV2::_fault()+0x6b) [0x5642d713f59b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x5642d71555e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x5642d7140114]",
"(AsyncConnection::process()+0x79c) [0x5642d711a82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x5642d6f7e91d]",
"(()+0x11f41cb) [0x5642d6f841cb]",
"(()+0xbbb2f) [0x7fd381138b2f]",
"(()+0x7fa3) [0x7fd381268fa3]",
"(clone()+0x3f) [0x7fd380e164cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-02-18T07:18:15.223807Z_5bbe94fe-466b-4de8-9037-3a0872916174",
"entity_name": "osd.14",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-02-18T07:18:15.223807Z",
"utsname_hostname": "ceph5-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@
@root@ceph5-9:~# ceph crash info 2021-02-19T08:43:19.626268Z_ad9492f6-ba47-4cfc-b4c0-0e311376140e
{
"backtrace": [
"(()+0x12730) [0x7fc180fe6730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5618e04fe394]",
"(AsyncConnection::_stop()+0xa7) [0x5618e04f89d7]",
"(ProtocolV2::stop()+0x8b) [0x5618e052041b]",
"(ProtocolV2::_fault()+0x6b) [0x5618e052059b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x5618e05365e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x5618e0521114]",
"(AsyncConnection::process()+0x79c) [0x5618e04fb82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x5618e035f91d]",
"(()+0x11f41cb) [0x5618e03651cb]",
"(()+0xbbb2f) [0x7fc180eabb2f]",
"(()+0x7fa3) [0x7fc180fdbfa3]",
"(clone()+0x3f) [0x7fc180b894cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-02-19T08:43:19.626268Z_ad9492f6-ba47-4cfc-b4c0-0e311376140e",
"entity_name": "osd.60",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-02-19T08:43:19.626268Z",
"utsname_hostname": "ceph5-9",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@
@root@ceph5-9:~# ceph crash info 2021-01-18T02:38:03.143317Z_dbc2f10d-26ae-4162-96da-78407c16d507
{
"backtrace": [
"(()+0x12730) [0x7f58610b1730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x55cee8fc1394]",
"(AsyncConnection::_stop()+0xa7) [0x55cee8fbb9d7]",
"(ProtocolV2::stop()+0x8b) [0x55cee8fe341b]",
"(ProtocolV2::_fault()+0x6b) [0x55cee8fe359b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x55cee8ff95e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55cee8fe4114]",
"(AsyncConnection::process()+0x79c) [0x55cee8fbe82c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x55cee8e2291d]",
"(()+0x11f41cb) [0x55cee8e281cb]",
"(()+0xbbb2f) [0x7f5860f76b2f]",
"(()+0x7fa3) [0x7f58610a6fa3]",
"(clone()+0x3f) [0x7f5860c544cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-01-18T02:38:03.143317Z_dbc2f10d-26ae-4162-96da-78407c16d507",
"entity_name": "osd.6",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-01-18T02:38:03.143317Z",
"utsname_hostname": "ceph5-2.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-6-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20)"
}
@
@root@ceph5-9:~# ceph crash info 2021-01-10T10:45:39.605761Z_0870ac8f-5d76-4146-8f55-f412f0188944
{
"archived": "2021-01-11 09:00:44.916944",
"backtrace": [
"(()+0x12730) [0x7f7ec7f8d730]",
"(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x55c55fd18394]",
"(AsyncConnection::_stop()+0xa7) [0x55c55fd129d7]",
"(ProtocolV2::stop()+0x8b) [0x55c55fd3a41b]",
"(ProtocolV2::_fault()+0x6b) [0x55c55fd3a59b]",
"(ProtocolV2::handle_read_frame_preamble_main(std::unique_ptr<ceph::buffer::v15_2_0::ptr_node, ceph::buffer::v15_2_0::ptr_node::disposer>&&, int)+0x328) [0x55c55fd505e8]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x55c55fd3b114]",
"(AsyncConnection::process()+0x79c) [0x55c55fd1582c]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0xa2d) [0x55c55fb7991d]",
"(()+0x11f41cb) [0x55c55fb7f1cb]",
"(()+0xbbb2f) [0x7f7ec7e52b2f]",
"(()+0x7fa3) [0x7f7ec7f82fa3]",
"(clone()+0x3f) [0x7f7ec7b304cf]"
],
"ceph_version": "15.2.7",
"crash_id": "2021-01-10T10:45:39.605761Z_0870ac8f-5d76-4146-8f55-f412f0188944",
"entity_name": "osd.57",
"os_id": "10",
"os_name": "Debian GNU/Linux 10 (buster)",
"os_version": "10 (buster)",
"os_version_id": "10",
"process_name": "ceph-osd",
"stack_sig": "897fe7f6bf2184fafd5b8a29905a147cb66850db318f6e874292a278aeb615bb",
"timestamp": "2021-01-10T10:45:39.605761Z",
"utsname_hostname": "ceph5-9",
"utsname_machine": "x86_64",
"utsname_release": "4.19.0-11-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 4.19.146-1 (2020-09-17)"
}
@
This is non baremetal, no container, debian10
Updated by Sage Weil about 3 years ago
this is reliably triggered by rados/cephadm/thrash on centos/rhel nodes (ubuntu seems fine, strangely) when --init is passed to podman. unclear why adding a container init process makes this bug surface in qa....
Updated by Sage Weil about 3 years ago
- Status changed from Need More Info to Fix Under Review
- Pull request ID set to 39739
Updated by Sage Weil about 3 years ago
- Status changed from Fix Under Review to Need More Info
- Pull request ID deleted (
39739)
Updated by Sage Weil about 3 years ago
- Priority changed from Urgent to High
Sage Weil wrote:
this is reliably triggered by rados/cephadm/thrash on centos/rhel nodes (ubuntu seems fine, strangely) when --init is passed to podman. unclear why adding a container init process makes this bug surface in qa....
The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)
I was mostly triggering a busy reconnect loops when trying to reproduce, not the segv. So this msgr issue is still a real bug, but probably not one we're likely to hit easily.
Updated by Adrian Dabuleanu about 3 years ago
I have encountered the same issue on my production ceph cluster with multiple OSD crashing. I am running ceph 15.2.8 orchestrated by rook 1.5.4 on top of k8s 1.20.1 . I have attached the debug log. Is there a workaround to get past this issue?
Updated by Adrian Dabuleanu about 3 years ago
We rebooted the physical servers two days ago and the OSD seem to be fine. But today, they started crashing again with the same error, but at a smaller scale
2021-04-07T12:41:57.826500Z_c1b71737-f3aa-4ff2-b2f3-6cfd7eefc006 osd.43 *
2021-04-07T12:42:58.478345Z_4b9e501b-fd28-4e34-9de9-3876d5cbfb47 osd.35 *
2021-04-07T12:43:09.249149Z_4a4eaecd-77d0-4c8e-a4cd-bcae601cff3b osd.30 *
2021-04-07T12:43:22.754376Z_d087e282-3bfc-454e-91cf-e1f876913c47 osd.26 *
2021-04-07T12:44:04.102748Z_32429126-d8a3-42b3-b715-923331ab6baa osd.38 *
2021-04-07T16:49:47.763038Z_4181a436-7875-45ad-9f79-fe087420fa92 osd.43 *
2021-04-07T16:50:43.682655Z_319b3b52-f1c3-4403-9110-bedf38af33c6 osd.45 *
2021-04-07T16:55:38.242908Z_49339a85-6f27-4086-b407-239fbd4b6989 osd.35 *
2021-04-07T17:01:01.102548Z_4f084c78-a92d-4435-a25f-393fc67fd555 osd.41 *
2021-04-07T17:01:10.771620Z_778426ca-9d81-4a3b-b836-6af3c3b639b7 osd.26 *
2021-04-07T17:02:00.399641Z_9b9ca6e5-7bb4-4307-a7a2-53435f1828ef osd.32 *
2021-04-07T17:06:09.158782Z_0d4f1891-a7df-4bbe-9508-f7a7506c660d osd.43 *
2021-04-07T17:06:56.888059Z_6d40a03b-c874-48d2-98c4-18ebfdc02137 osd.37 *
2021-04-07T19:32:16.772362Z_5d342b0b-7831-4ad2-ba04-06e44ba995af osd.43 *
2021-04-07T19:34:20.696933Z_12e57f20-91c8-451f-94ab-25a3082b1a12 osd.38 *
2021-04-07T19:35:27.978435Z_a6b61afa-4cdf-4a1f-82af-8d454c914424 osd.35 *
I want to understand what is causing this. Mr Sage Weil can you please give more details more on this comment? I want to understand how this is mapping to my 3 nodes k8s cluster.
The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)
Thanks,
Adrian
Updated by Sage Weil about 3 years ago
Adrian Dabuleanu wrote:
We rebooted the physical servers two days ago and the OSD seem to be fine. But today, they started crashing again with the same error, but at a smaller scale
[...]I want to understand what is causing this. Mr Sage Weil can you please give more details more on this comment? I want to understand how this is mapping to my 3 nodes k8s cluster.
The reason was that multiple osds ended up with identical addrs because the container PIDs were always 7. (pid 1 -> using a random value for a nonce, which is why no container init worked properly.)
Thanks,
Adrian
Can you share teh output from 'ceph osd dump'? I'm curious if the ports are randomized or not (and whether this has the same cause as the issue I saw)
Updated by Adrian Dabuleanu about 3 years ago
Here is the output
epoch 49053
fsid dbb096d6-d67d-4319-a41b-e113a181c414
created 2021-01-09T13:41:43.929133+0000
modified 2021-04-08T13:44:13.584097+0000
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 79
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client luminous
require_osd_release octopus
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 49035 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6980 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 25638 lfor 0/4968/4966 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 6980 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 5 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 34566 lfor 0/5026/5024 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 48906 lfor 0/48906/48904 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'hdd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 15386 lfor 0/15386/15384 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
max_osd 49
osd.0 up in weight 1 up_from 33984 up_thru 48703 down_at 33971 last_clean_interval [30559,33965) [v2:10.40.136.19:6805/489468713,v1:10.40.136.19:6808/489468713] [v2:10.40.137.19:6806/489468713,v1:10.40.137.19:6808/489468713] exists,up d4df9d97-70a8-458a-9e8d-ebfb682d4265
osd.1 up in weight 1 up_from 34099 up_thru 48891 down_at 34082 last_clean_interval [30819,34078) [v2:10.40.136.30:6832/367310845,v1:10.40.136.30:6833/367310845] [v2:10.40.137.30:6832/367310845,v1:10.40.137.30:6833/367310845] exists,up 63ddfd79-7da9-4a29-aada-6e6de76c75f2
osd.2 up in weight 1 up_from 34064 up_thru 49006 down_at 34015 last_clean_interval [31075,34009) [v2:10.40.136.31:6824/2689213217,v1:10.40.136.31:6825/2689213217] [v2:10.40.137.31:6824/2689213217,v1:10.40.137.31:6825/2689213217] exists,up fbf8598b-f166-4d61-9a81-58b9e480e123
osd.3 up in weight 0.950012 up_from 33985 up_thru 48889 down_at 33966 last_clean_interval [33054,33965) [v2:10.40.136.19:6800/1657213001,v1:10.40.136.19:6801/1657213001] [v2:10.40.137.19:6800/1657213001,v1:10.40.137.19:6801/1657213001] exists,up d4960f2b-3eba-4163-8cf5-52d95cb08a9d
osd.4 up in weight 1 up_from 34098 up_thru 49023 down_at 34075 last_clean_interval [34056,34074) [v2:10.40.136.30:6801/4154683830,v1:10.40.136.30:6804/4154683830] [v2:10.40.137.30:6802/4154683830,v1:10.40.137.30:6803/4154683830] exists,up cf7854d2-f86f-47d0-9b98-47fc4b50053c
osd.5 up in weight 1 up_from 34064 up_thru 49011 down_at 34016 last_clean_interval [30843,34009) [v2:10.40.136.31:6826/348024490,v1:10.40.136.31:6827/348024490] [v2:10.40.137.31:6826/348024490,v1:10.40.137.31:6827/348024490] exists,up 0ba18c7b-9ff7-4a46-9c72-5ec62151ad97
osd.6 up in weight 1 up_from 33984 up_thru 48252 down_at 33969 last_clean_interval [29881,33965) [v2:10.40.136.19:6806/3441212294,v1:10.40.136.19:6810/3441212294] [v2:10.40.137.19:6807/3441212294,v1:10.40.137.19:6809/3441212294] exists,up e7e98f66-c126-4b62-b010-931d92b7ce6d
osd.7 up in weight 1 up_from 34098 up_thru 48906 down_at 34079 last_clean_interval [30692,34074) [v2:10.40.136.30:6802/308382208,v1:10.40.136.30:6806/308382208] [v2:10.40.137.30:6804/308382208,v1:10.40.137.30:6805/308382208] exists,up 5881d174-1a31-427c-ab24-46971ae54e7f
osd.8 up in weight 1 up_from 34064 up_thru 49012 down_at 34017 last_clean_interval [30707,34010) [v2:10.40.136.31:6848/2916991801,v1:10.40.136.31:6849/2916991801] [v2:10.40.137.31:6848/2916991801,v1:10.40.137.31:6849/2916991801] exists,up 5bb89940-a98f-40c3-b168-b9bc3e98f1b3
osd.9 up in weight 0.950012 up_from 33985 up_thru 49015 down_at 33968 last_clean_interval [29881,33965) [v2:10.40.136.19:6841/1189783200,v1:10.40.136.19:6844/1189783200] [v2:10.40.137.19:6841/1189783200,v1:10.40.137.19:6844/1189783200] exists,up 3fe3fd66-93b0-440f-993b-445453415fbd
osd.10 up in weight 1 up_from 34098 up_thru 49018 down_at 34080 last_clean_interval [30831,34076) [v2:10.40.136.30:6816/1545087195,v1:10.40.136.30:6817/1545087195] [v2:10.40.137.30:6816/1545087195,v1:10.40.137.30:6817/1545087195] exists,up ea0c0156-424a-4d54-8a35-3a0219ce1cef
osd.11 up in weight 1 up_from 34064 up_thru 49007 down_at 34014 last_clean_interval [30764,34008) [v2:10.40.136.31:6840/208244957,v1:10.40.136.31:6841/208244957] [v2:10.40.137.31:6840/208244957,v1:10.40.137.31:6841/208244957] exists,up bf730401-ee2b-4f15-88fb-149fa804ff12
osd.12 up in weight 1 up_from 33984 up_thru 48604 down_at 33970 last_clean_interval [30703,33966) [v2:10.40.136.19:6817/3440415969,v1:10.40.136.19:6819/3440415969] [v2:10.40.137.19:6817/3440415969,v1:10.40.137.19:6819/3440415969] exists,up 04008079-0230-4b09-be44-28c6b9fbfcd2
osd.13 up in weight 1 up_from 34099 up_thru 48911 down_at 34080 last_clean_interval [30711,34076) [v2:10.40.136.30:6836/1069316143,v1:10.40.136.30:6837/1069316143] [v2:10.40.137.30:6836/1069316143,v1:10.40.137.30:6837/1069316143] exists,up 03177899-807e-4e4f-9391-dfeaaa93b1ac
osd.14 up in weight 1 up_from 34064 up_thru 49008 down_at 34015 last_clean_interval [30139,34008) [v2:10.40.136.31:6800/1901072215,v1:10.40.136.31:6803/1901072215] [v2:10.40.137.31:6802/1901072215,v1:10.40.137.31:6803/1901072215] exists,up dda69eef-c998-4e32-bb03-9828066611a7
osd.15 up in weight 1 up_from 33986 up_thru 49016 down_at 33969 last_clean_interval [30823,33965) [v2:10.40.136.19:6852/1228464414,v1:10.40.136.19:6854/1228464414] [v2:10.40.137.19:6852/1228464414,v1:10.40.137.19:6854/1228464414] exists,up cf163e4d-73cf-44d4-989b-abac839f7f4f
osd.16 up in weight 1 up_from 34098 up_thru 48594 down_at 34080 last_clean_interval [27903,34075) [v2:10.40.136.30:6820/346991284,v1:10.40.136.30:6822/346991284] [v2:10.40.137.30:6820/346991284,v1:10.40.137.30:6822/346991284] exists,up 2cb45b04-bca0-40c1-a850-92b8b843c364
osd.17 up in weight 0.950012 up_from 34063 up_thru 48855 down_at 34015 last_clean_interval [30218,34008) [v2:10.40.136.31:6834/797694268,v1:10.40.136.31:6837/797694268] [v2:10.40.137.31:6835/797694268,v1:10.40.137.31:6837/797694268] exists,up 3b4ef426-e56b-47ea-a25f-5a281f6b0e4d
osd.18 up in weight 1 up_from 34098 up_thru 48145 down_at 34077 last_clean_interval [30215,34074) [v2:10.40.136.30:6805/736025612,v1:10.40.136.30:6808/736025612] [v2:10.40.137.30:6806/736025612,v1:10.40.137.30:6808/736025612] exists,up 2b53548a-d282-40ca-a1af-ffa1c8c5549b
osd.19 up in weight 0.599991 up_from 33984 up_thru 48898 down_at 33970 last_clean_interval [27907,33965) [v2:10.40.136.19:6835/955730844,v1:10.40.136.19:6837/955730844] [v2:10.40.137.19:6835/955730844,v1:10.40.137.19:6837/955730844] exists,up 131b4f47-0a7e-4fc3-a1ec-3fd6b0749ad7
osd.20 up in weight 1 up_from 34064 up_thru 48913 down_at 34012 last_clean_interval [30827,34008) [v2:10.40.136.31:6816/1081193690,v1:10.40.136.31:6817/1081193690] [v2:10.40.137.31:6816/1081193690,v1:10.40.137.31:6817/1081193690] exists,up ee19dac1-71d6-4e89-a75d-67ff61a9a873
osd.21 up in weight 1 up_from 34099 up_thru 48487 down_at 34077 last_clean_interval [30904,34074) [v2:10.40.136.30:6800/2151805991,v1:10.40.136.30:6803/2151805991] [v2:10.40.137.30:6800/2151805991,v1:10.40.137.30:6801/2151805991] exists,up 557353a4-d41c-427a-9602-b5df1c05f27b
osd.22 up in weight 1 up_from 33984 up_thru 49024 down_at 33971 last_clean_interval [30153,33965) [v2:10.40.136.19:6802/1448300047,v1:10.40.136.19:6803/1448300047] [v2:10.40.137.19:6802/1448300047,v1:10.40.137.19:6803/1448300047] exists,up 98fbdb05-a31e-4f65-a841-46a9dcc4acab
osd.23 up in weight 1 up_from 34064 up_thru 49020 down_at 34014 last_clean_interval [29790,34008) [v2:10.40.136.31:6801/2069623313,v1:10.40.136.31:6802/2069623313] [v2:10.40.137.31:6800/2069623313,v1:10.40.137.31:6801/2069623313] exists,up 133f7a8d-20d3-43d8-93fc-a41ea9c93a8b
osd.24 up in weight 1 up_from 34069 up_thru 48985 down_at 34013 last_clean_interval [33937,34010) [v2:10.40.136.31:6851/111935890,v1:10.40.136.31:6853/111935890] [v2:10.40.137.31:6851/111935890,v1:10.40.137.31:6853/111935890] exists,up 13ba14e8-ffa8-4328-a257-108dc8feccfd
osd.25 up in weight 1 up_from 33994 up_thru 48980 down_at 33969 last_clean_interval [33194,33965) [v2:10.40.136.19:6816/1578177584,v1:10.40.136.19:6818/1578177584] [v2:10.40.137.19:6816/1578177584,v1:10.40.137.19:6818/1578177584] exists,up 4eba66a4-15e5-49a2-8454-20340926cae1
osd.26 up in weight 1 up_from 44684 up_thru 48928 down_at 44682 last_clean_interval [36381,44681) [v2:10.40.136.30:6856/2667036643,v1:10.40.136.30:6857/2667036643] [v2:10.40.137.30:6856/2667036643,v1:10.40.137.30:6857/2667036643] exists,up 9ede428c-6c86-4955-bc6f-6f906364b7f5
osd.27 up in weight 1 up_from 34067 up_thru 48990 down_at 34011 last_clean_interval [30647,34008) [v2:10.40.136.31:6804/3247864981,v1:10.40.136.31:6806/3247864981] [v2:10.40.137.31:6804/3247864981,v1:10.40.137.31:6806/3247864981] exists,up 7967a10d-e3e9-4b90-9cbc-ed8e1fb34127
osd.28 up in weight 1 up_from 34002 up_thru 48976 down_at 33966 last_clean_interval [33568,33965) [v2:10.40.136.19:6848/1211909359,v1:10.40.136.19:6849/1211909359] [v2:10.40.137.19:6848/1211909359,v1:10.40.137.19:6849/1211909359] exists,up 5a18ef00-426f-4149-a8ac-6b83bc0057df
osd.29 up in weight 1 up_from 34106 up_thru 48940 down_at 34078 last_clean_interval [27917,34074) [v2:10.40.136.30:6821/191184276,v1:10.40.136.30:6823/191184276] [v2:10.40.137.30:6821/191184276,v1:10.40.137.30:6823/191184276] exists,up aa5d15ed-901b-4795-9891-c6806b6f8920
osd.30 up in weight 1 up_from 36374 up_thru 48962 down_at 36368 last_clean_interval [34067,36366) [v2:10.40.136.31:6842/2131009470,v1:10.40.136.31:6843/2131009470] [v2:10.40.137.31:6842/2131009470,v1:10.40.137.31:6843/2131009470] exists,up 610dcf6d-7651-497f-b279-a6d254a18c6b
osd.31 up in weight 1 up_from 33997 up_thru 48962 down_at 33966 last_clean_interval [30730,33965) [v2:10.40.136.19:6824/3965926459,v1:10.40.136.19:6825/3965926459] [v2:10.40.137.19:6824/3965926459,v1:10.40.137.19:6825/3965926459] exists,up 96711ba4-9b14-42ba-bfef-5b5e41c787ec
osd.32 up in weight 1 up_from 44704 up_thru 48932 down_at 44697 last_clean_interval [34114,44696) [v2:10.40.136.30:6860/98181250,v1:10.40.136.30:6861/98181250] [v2:10.40.137.30:6860/98181250,v1:10.40.137.30:6861/98181250] exists,up 8602ea9d-859d-4bfa-9a5a-d48626f3e2ce
osd.33 up in weight 1 up_from 34068 up_thru 48992 down_at 34010 last_clean_interval [33882,34008) [v2:10.40.136.31:6860/75616253,v1:10.40.136.31:6861/75616253] [v2:10.40.137.31:6860/75616253,v1:10.40.137.31:6861/75616253] exists,up b14294bc-f6fd-431c-82b4-504787d155c3
osd.34 up in weight 1 up_from 33998 up_thru 48994 down_at 33966 last_clean_interval [33268,33965) [v2:10.40.136.19:6853/4213686847,v1:10.40.136.19:6855/4213686847] [v2:10.40.137.19:6853/4213686847,v1:10.40.137.19:6855/4213686847] exists,up 20b1fc79-1699-4d0c-aa08-b20980830d5b
osd.35 up in weight 1 up_from 48980 up_thru 49000 down_at 48976 last_clean_interval [44535,48974) [v2:10.40.136.30:6828/796973594,v1:10.40.136.30:6829/796973594] [v2:10.40.137.30:6828/796973594,v1:10.40.137.30:6829/796973594] exists,up b315a268-f54e-44c7-bdf9-11a001c1c3c7
osd.36 up in weight 1 up_from 34068 up_thru 48976 down_at 34009 last_clean_interval [33745,34008) [v2:10.40.136.31:6812/357593772,v1:10.40.136.31:6813/357593772] [v2:10.40.137.31:6812/357593772,v1:10.40.137.31:6813/357593772] exists,up f63274bf-1ded-47bc-bf82-6ceddf20b97a
osd.37 up in weight 1 up_from 44846 up_thru 48999 down_at 44844 last_clean_interval [34000,44843) [v2:10.40.136.19:6834/2202899341,v1:10.40.136.19:6836/2202899341] [v2:10.40.137.19:6834/2202899341,v1:10.40.137.19:6836/2202899341] exists,up 8455b9d3-6e96-491a-89e6-75f2a9c3f6ba
osd.38 up in weight 1 up_from 48962 up_thru 48978 down_at 48958 last_clean_interval [36400,48957) [v2:10.40.136.30:6838/2897105189,v1:10.40.136.30:6840/2897105189] [v2:10.40.137.30:6838/2897105189,v1:10.40.137.30:6840/2897105189] exists,up 6171592f-727e-44aa-84ce-755010f7a6e3
osd.39 up in weight 1 up_from 34069 up_thru 48976 down_at 34012 last_clean_interval [28274,34008) [v2:10.40.136.31:6820/425484265,v1:10.40.136.31:6821/425484265] [v2:10.40.137.31:6820/425484265,v1:10.40.137.31:6821/425484265] exists,up 37450001-e084-4ad9-9aa1-0e0f8be99b57
osd.40 up in weight 1 up_from 33991 up_thru 48976 down_at 33969 last_clean_interval [28664,33965) [v2:10.40.136.19:6828/1352966624,v1:10.40.136.19:6829/1352966624] [v2:10.40.137.19:6828/1352966624,v1:10.40.137.19:6829/1352966624] exists,up 23c854cc-a235-4fd9-8b60-b09820a8c148
osd.41 up in weight 1 up_from 44685 up_thru 48952 down_at 44680 last_clean_interval [34106,44678) [v2:10.40.136.30:6850/125657673,v1:10.40.136.30:6852/125657673] [v2:10.40.137.30:6850/125657673,v1:10.40.137.30:6853/125657673] exists,up 7e0b1ad7-c6c0-421f-b1a9-d926e572c9d1
osd.43 up in weight 1 up_from 48928 up_thru 48997 down_at 48926 last_clean_interval [44832,48925) [v2:10.40.136.19:6860/2439838026,v1:10.40.136.19:6861/2439838026] [v2:10.40.137.19:6860/2439838026,v1:10.40.137.19:6861/2439838026] exists,up a9c558e1-b5ed-432f-b927-ac13ee201af9
osd.44 up in weight 1 up_from 34109 up_thru 44864 down_at 34076 last_clean_interval [33890,34074) [v2:10.40.136.30:6846/3516264091,v1:10.40.136.30:6847/3516264091] [v2:10.40.137.30:6846/3516264091,v1:10.40.137.30:6847/3516264091] exists,up a78ff880-01e7-4fb5-8835-838272340946
osd.45 up in weight 1 up_from 44396 up_thru 48980 down_at 44393 last_clean_interval [34068,44392) [v2:10.40.136.31:6850/3605419931,v1:10.40.136.31:6852/3605419931] [v2:10.40.137.31:6850/3605419931,v1:10.40.137.31:6852/3605419931] exists,up 3bcadd7c-ab5e-416c-a084-b9653be2704b
osd.46 up in weight 1 up_from 34109 up_thru 48926 down_at 34078 last_clean_interval [28315,34074) [v2:10.40.136.30:6844/1547041860,v1:10.40.136.30:6845/1547041860] [v2:10.40.137.30:6844/1547041860,v1:10.40.137.30:6845/1547041860] exists,up 48eadccf-e669-4a38-afea-724ec30ad2dc
osd.47 up in weight 1 up_from 34067 up_thru 48980 down_at 34016 last_clean_interval [27926,34009) [v2:10.40.136.31:6832/1440913495,v1:10.40.136.31:6833/1440913495] [v2:10.40.137.31:6832/1440913495,v1:10.40.137.31:6833/1440913495] exists,up 60f4b16e-24a0-4c6d-aecd-c5af45925d1e
osd.48 up in weight 1 up_from 33989 up_thru 48996 down_at 33970 last_clean_interval [27911,33965) [v2:10.40.136.19:6831/678817177,v1:10.40.136.19:6833/678817177] [v2:10.40.137.19:6831/678817177,v1:10.40.137.19:6833/678817177] exists,up ce49568e-3f26-451e-a370-ed6ab8c717a9
pg_upmap_items 6.3 [11,14]
pg_upmap_items 6.5 [9,15]
pg_upmap_items 6.7 [5,8]
pg_upmap_items 6.a [19,22]
pg_upmap_items 6.1a [19,6]
pg_upmap_items 6.20 [9,15]
pg_upmap_items 6.25 [9,15]
pg_upmap_items 6.2c [19,6,5,8]
pg_upmap_items 6.31 [21,16]
pg_upmap_items 6.35 [19,6]
pg_upmap_items 6.39 [9,15]
pg_upmap_items 6.58 [9,15]
pg_upmap_items 6.5e [5,8]
pg_upmap_items 6.69 [19,6]
pg_upmap_items 6.6d [9,15]
pg_upmap_items 6.7c [19,6]
pg_upmap_items 7.2 [26,35]
pg_upmap_items 7.b [36,39]
pg_upmap_items 7.c [44,41]
pg_upmap_items 7.10 [36,39]
pg_upmap_items 7.11 [46,32]
pg_upmap_items 7.12 [26,32]
pg_upmap_items 7.14 [26,41]
pg_upmap_items 7.19 [36,24]
pg_upmap_items 7.1e [24,45]
pg_upmap_items 7.1f [38,32]
pg_upmap_items 7.24 [48,31]
pg_upmap_items 7.2c [36,27]
pg_upmap_items 7.2d [46,35]
pg_upmap_items 7.34 [36,27]
pg_upmap_items 7.37 [24,47]
pg_upmap_items 7.3b [38,41]
pg_upmap_items 7.3f [46,41]
pg_temp 7.fe [44,48,39]
blacklist 10.40.136.31:0/1543267051 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6865/1735026091 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6864/1735026091 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:0/1120460670 expires 2021-04-09T10:07:44.091402+0000
blacklist 10.40.136.31:6865/2542377202 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/3055030675 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/376350552 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:6864/2542377202 expires 2021-04-08T23:23:50.751408+0000
blacklist 10.40.136.31:0/2796177981 expires 2021-04-09T10:07:44.091402+0000
Updated by Sage Weil about 3 years ago
can you try setting log_to_file=true, debug_ms=0/20, and reproducing the crash? If you can send us the resulting log file that includes a crash?
Updated by Adrian Dabuleanu about 3 years ago
Had another 2 OSD crash today. Here are the OSD logs (crash-osd-34.log and crash-osd-37.log) https://drive.google.com/drive/folders/1NiGiujjKw-wIXOnne2dERH7Q8zoXQBQC , after applying the debug log that you provided. I could not attach it to this ticket, because they are around 10 Mb each.
Please let me know if you need other logs.
Thanks,
Adrian
Updated by Sage Weil about 3 years ago
- Status changed from Need More Info to Fix Under Review
As before, it looks like 2 problems here:
1. this crash itself. i think i see the locking bug.
2. the peer is in a reconnect loop. in my case I saw this because of the nonce issue; I'm not sure why you are seeing it. :/ Fixing (1) will prevent a crash but won't address the underlying issue...
Updated by Sage Weil about 3 years ago
- Backport set to pacific,octopus
- Pull request ID set to 40912
Updated by Adrian Dabuleanu about 3 years ago
After observing the crash thread we found out that a possible cause of crashes is related to RAM issues. During crashes we have some OSD crash due to OOM
$ dmesg -T
[Wed Apr 21 09:09:26 2021] ceph-osd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=1000
[Wed Apr 21 09:09:26 2021] CPU: 34 PID: 1983779 Comm: ceph-osd Kdump: loaded Tainted: G I --------- - - 4.18.0-240.15.1.el8_3.x86_64 #1
[Wed Apr 21 09:09:26 2021] Hardware name: Dell Inc. PowerEdge R740xd/08D89F, BIOS 1.5.6 10/17/2018
[Wed Apr 21 09:09:26 2021] Call Trace:
[Wed Apr 21 09:09:26 2021] dump_stack+0x5c/0x80
[Wed Apr 21 09:09:26 2021] dump_header+0x51/0x308
[Wed Apr 21 09:09:26 2021] ? try_to_free_pages+0xe8/0x1c0
[Wed Apr 21 09:09:26 2021] oom_kill_process.cold.28+0xb/0x10
[Wed Apr 21 09:09:26 2021] out_of_memory+0x1c1/0x4b0
[Wed Apr 21 09:09:26 2021] __alloc_pages_slowpath+0xc24/0xd40
[Wed Apr 21 09:09:26 2021] __alloc_pages_nodemask+0x245/0x280
[Wed Apr 21 09:09:26 2021] filemap_fault+0x3b8/0x840
[Wed Apr 21 09:09:26 2021] ? hrtimer_try_to_cancel+0x25/0x100
[Wed Apr 21 09:09:26 2021] ? _cond_resched+0x15/0x30
[Wed Apr 21 09:09:26 2021] __xfs_filemap_fault+0x6d/0x200 [xfs]
[Wed Apr 21 09:09:26 2021] __do_fault+0x38/0xc0
[Wed Apr 21 09:09:26 2021] do_fault+0x191/0x3c0
[Wed Apr 21 09:09:26 2021] __handle_mm_fault+0x3e6/0x7c0
[Wed Apr 21 09:09:26 2021] handle_mm_fault+0xc2/0x1d0
[Wed Apr 21 09:09:26 2021] __do_page_fault+0x21b/0x4d0
[Wed Apr 21 09:09:26 2021] do_page_fault+0x32/0x110
[Wed Apr 21 09:09:26 2021] ? page_fault+0x8/0x30
[Wed Apr 21 09:09:26 2021] page_fault+0x1e/0x30
[Wed Apr 21 09:09:26 2021] RIP: 0033:0x7f4b30a5564a
[Wed Apr 21 09:09:26 2021] Code: Bad RIP value.
[Wed Apr 21 09:09:26 2021] RSP: 002b:00007f4b2aa63570 EFLAGS: 00010246
[Wed Apr 21 09:09:26 2021] RAX: ffffffffffffff92 RBX: 0000556b5462a1e8 RCX: 00007f4b30a5564a
[Wed Apr 21 09:09:26 2021] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 0000556b5462a214
[Wed Apr 21 09:09:26 2021] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[Wed Apr 21 09:09:26 2021] R10: 00007f4b2aa63680 R11: 0000000000000246 R12: 0000556b5462a1c0
[Wed Apr 21 09:09:26 2021] R13: 0000556b5462a214 R14: 00007f4b2aa63680 R15: 0000000000000000
[Wed Apr 21 09:09:26 2021] Mem-Info:
[Wed Apr 21 09:09:26 2021] active_anon:48316112 inactive_anon:8718 isolated_anon:0
active_file:64 inactive_file:990 isolated_file:41
unevictable:0 dirty:0 writeback:0 unstable:0
slab_reclaimable:94846 slab_unreclaimable:310927
mapped:6771 shmem:9771 pagetables:106620 bounce:0
free:105307 free_pcp:57 free_cma:0
[Wed Apr 21 09:09:26 2021] Node 0 active_anon:95578536kB inactive_anon:1356kB active_file:0kB inactive_file:652kB unevictable:0kB isolated(anon):0kB isolated(file):28kB mapped:0kB dirty:0kB writeback:0kB shmem:2636kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 51200kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Apr 21 09:09:26 2021] Node 1 active_anon:97685912kB inactive_anon:33516kB active_file:256kB inactive_file:3308kB unevictable:0kB isolated(anon):0kB isolated(file):136kB mapped:27084kB dirty:0kB writeback:0kB shmem:36448kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2048kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[Wed Apr 21 09:09:26 2021] Node 0 DMA free:15552kB min:4kB low:16kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15552kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 1362 87617 87617 87617
[Wed Apr 21 09:09:26 2021] Node 0 DMA32 free:345236kB min:684kB low:2076kB high:3468kB active_anon:1056416kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1755968kB managed:1427384kB mlocked:0kB kernel_stack:208kB pagetables:180kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 86254 86254 86254
[Wed Apr 21 09:09:26 2021] Node 0 Normal free:16260kB min:43504kB low:131828kB high:220152kB active_anon:94522120kB inactive_anon:1356kB active_file:0kB inactive_file:652kB unevictable:0kB writepending:0kB present:97517568kB managed:88324448kB mlocked:0kB kernel_stack:20376kB pagetables:188656kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 0 0 0
[Wed Apr 21 09:09:26 2021] Node 1 Normal free:44180kB min:45908kB low:139108kB high:232308kB active_anon:97685912kB inactive_anon:33516kB active_file:256kB inactive_file:3308kB unevictable:0kB writepending:0kB present:100663296kB managed:93208884kB mlocked:0kB kernel_stack:24936kB pagetables:237644kB bounce:0kB free_pcp:108kB local_pcp:0kB free_cma:0kB
[Wed Apr 21 09:09:26 2021] lowmem_reserve[]: 0 0 0 0 0
[Wed Apr 21 09:09:26 2021] Node 0 DMA: 0*4kB 2*8kB (U) 3*16kB (U) 2*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15552kB
[Wed Apr 21 09:09:26 2021] Node 0 DMA32: 414*4kB (MEH) 711*8kB (UMEH) 792*16kB (UMH) 507*32kB (UMEH) 346*64kB (UMEH) 166*128kB (UMH) 46*256kB (UMEH) 14*512kB (UME) 9*1024kB (UM) 4*2048kB (UE) 56*4096kB (UM) = 345360kB
[Wed Apr 21 09:09:26 2021] Node 0 Normal: 1479*4kB (UMEH) 188*8kB (UME) 105*16kB (UME) 200*32kB (UEH) 4*64kB (H) 2*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16268kB
[Wed Apr 21 09:09:26 2021] Node 1 Normal: 2452*4kB (UMH) 1434*8kB (UMEH) 1157*16kB (UMEH) 220*32kB (UEH) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46896kB
[Wed Apr 21 09:09:26 2021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Apr 21 09:09:26 2021] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Apr 21 09:09:26 2021] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[Wed Apr 21 09:09:26 2021] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[Wed Apr 21 09:09:26 2021] 11213 total pagecache pages
[Wed Apr 21 09:09:26 2021] 0 pages in swap cache
[Wed Apr 21 09:09:26 2021] Swap cache stats: add 0, delete 0, find 0/0
[Wed Apr 21 09:09:26 2021] Free swap = 0kB
[Wed Apr 21 09:09:26 2021] Total swap = 0kB
[Wed Apr 21 09:09:26 2021] 49988207 pages RAM
[Wed Apr 21 09:09:26 2021] 0 pages HighMem/MovableOnly
[Wed Apr 21 09:09:26 2021] 4244140 pages reserved
[Wed Apr 21 09:09:26 2021] 0 pages hwpoisoned
[Wed Apr 21 09:09:26 2021] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Wed Apr 21 09:09:26 2021] [ 1584] 0 1584 37101 12093 344064 0 0 systemd-journal
[Wed Apr 21 09:09:26 2021] [ 1621] 0 1621 29538 608 225280 0 -1000 systemd-udevd
[Wed Apr 21 09:09:26 2021] [ 2136] 0 2136 41174 196 188416 0 -1000 auditd
[Wed Apr 21 09:09:26 2021] [ 2138] 0 2138 12130 90 139264 0 0 sedispatch
[Wed Apr 21 09:09:26 2021] [ 2166] 998 2166 508458 1763 393216 0 0 polkitd
[Wed Apr 21 09:09:26 2021] [ 2169] 0 2169 4437 37 65536 0 0 mcelog
[Wed Apr 21 09:09:26 2021] [ 2170] 81 2170 19159 201 167936 0 -900 dbus-daemon
[Wed Apr 21 09:09:26 2021] [ 2173] 0 2173 53699 507 430080 0 0 sssd
[Wed Apr 21 09:09:26 2021] [ 2177] 997 2177 4928 39 69632 0 0 lsmd
[Wed Apr 21 09:09:26 2021] [ 2179] 0 2179 31315 234 143360 0 0 irqbalance
[Wed Apr 21 09:09:26 2021] [ 2183] 0 2183 12759 422 139264 0 0 smartd
[Wed Apr 21 09:09:26 2021] [ 2184] 989 2184 95327 210 233472 0 0 rngd
[Wed Apr 21 09:09:26 2021] [ 2214] 990 2214 32228 134 159744 0 0 chronyd
[Wed Apr 21 09:09:26 2021] [ 2252] 0 2252 55312 640 430080 0 0 sssd_be
[Wed Apr 21 09:09:26 2021] [ 2283] 0 2283 56216 418 466944 0 0 sssd_nss
[Wed Apr 21 09:09:26 2021] [ 2303] 0 2303 20976 257 196608 0 0 systemd-logind
[Wed Apr 21 09:09:26 2021] [ 2939] 0 2939 23072 224 192512 0 -1000 sshd
[Wed Apr 21 09:09:26 2021] [ 2940] 0 2940 106588 3764 434176 0 0 tuned
[Wed Apr 21 09:09:26 2021] [ 2941] 0 2941 1504994 23199 1138688 0 -999 kubelet
[Wed Apr 21 09:09:26 2021] [ 2942] 0 2942 66858 5534 282624 0 0 rsyslogd
[Wed Apr 21 09:09:26 2021] [ 2952] 0 2952 1372921 10680 888832 0 -999 containerd
[Wed Apr 21 09:09:26 2021] [ 2955] 0 2955 9232 221 106496 0 0 crond
[Wed Apr 21 09:09:26 2021] [ 2956] 0 2956 10994 51 118784 0 0 atd
[Wed Apr 21 09:09:26 2021] [ 2998] 0 2998 3408 28 61440 0 0 agetty
[Wed Apr 21 09:09:26 2021] [ 3064] 0 3064 1432928 26985 1171456 0 -999 dockerd
[Wed Apr 21 09:09:26 2021] [ 6602] 0 6602 28280 327 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6603] 0 6603 27992 320 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6604] 0 6604 27992 260 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6605] 0 6605 27992 292 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6678] 0 6678 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6680] 0 6680 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6688] 0 6688 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6699] 0 6699 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6767] 0 6767 27992 279 77824 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6802] 0 6802 27992 282 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6821] 0 6821 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6830] 0 6830 28008 350 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6843] 0 6843 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 6867] 0 6867 28344 340 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6886] 0 6886 27992 336 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6892] 0 6892 27992 290 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6912] 0 6912 27992 318 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6947] 0 6947 27992 259 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 6995] 0 6995 28360 353 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7007] 0 7007 28344 336 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7044] 0 7044 28344 313 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7057] 0 7057 28360 308 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7086] 0 7086 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7131] 0 7131 27992 344 77824 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7132] 0 7132 27944 269 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7133] 0 7133 27928 262 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7156] 0 7156 28344 265 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7203] 0 7203 28296 324 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7214] 0 7214 242 1 24576 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7219] 0 7219 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7225] 0 7225 242 1 32768 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7266] 0 7266 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7267] 0 7267 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7293] 0 7293 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7294] 0 7294 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7295] 0 7295 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7312] 0 7312 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7321] 0 7321 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7322] 0 7322 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7343] 0 7343 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7345] 0 7345 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7346] 0 7346 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 7455] 0 7455 28360 338 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7458] 0 7458 28008 269 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7600] 0 7600 28008 333 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7635] 0 7635 178751 758 110592 0 1000 csi-node-driver
[Wed Apr 21 09:09:26 2021] [ 7649] 0 7649 178751 734 114688 0 1000 csi-node-driver
[Wed Apr 21 09:09:26 2021] [ 7656] 0 7656 187811 3370 212992 0 -999 kube-proxy
[Wed Apr 21 09:09:26 2021] [ 7768] 0 7768 28344 269 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7801] 0 7801 28344 348 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 7866] 0 7866 1090469 2576 757760 0 1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 7874] 0 7874 1071844 2649 745472 0 1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8349] 0 8349 27928 301 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8449] 0 8449 28344 313 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8561] 0 8561 850073 2535 643072 0 1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8599] 0 8599 831831 2370 630784 0 1000 cephcsi
[Wed Apr 21 09:09:26 2021] [ 8942] 0 8942 27992 343 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8982] 0 8982 27992 343 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 8997] 65534 8997 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 9011] 0 9011 242 1 28672 0 -998 pause
[Wed Apr 21 09:09:26 2021] [ 9055] 0 9055 28360 326 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [ 9094] 65534 9094 181328 2714 180224 0 1000 node_exporter
[Wed Apr 21 09:09:26 2021] [10526] 0 10526 27992 307 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [10593] 0 10593 979402 2553 561152 0 -997 flanneld
[Wed Apr 21 09:09:26 2021] [10723] 0 10723 28344 780 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [10812] 167 10812 4440418 4029129 34652160 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [10945] 0 10945 28344 781 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11034] 0 11034 28344 727 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11052] 167 11052 3212853 2558465 24850432 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11139] 167 11139 4439593 4079185 34639872 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11227] 0 11227 28344 749 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11278] 0 11278 28344 709 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11305] 167 11305 2546593 2171110 19484672 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [11364] 167 11364 434890 241374 2949120 0 1000 ceph-mon
[Wed Apr 21 09:09:26 2021] [11537] 0 11537 27944 315 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11573] 0 11573 11140 1527 135168 0 1000 ceph-crash
[Wed Apr 21 09:09:26 2021] [11613] 0 11613 28344 680 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [11633] 167 11633 4295048 3911305 33476608 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [1966072] 0 1966072 28344 730 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [1966092] 167 1966092 2536442 1917360 19382272 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [1983663] 0 1983663 28344 595 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [1983692] 167 1983692 2349325 1870377 17883136 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2003377] 0 2003377 28344 761 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2003396] 167 2003396 2422957 1897019 18501632 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2004557] 0 2004557 28344 670 94208 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2004583] 167 2004583 2823392 2399948 21696512 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2009557] 0 2009557 28344 625 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2009578] 167 2009578 2466486 2022868 18841600 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [2010358] 0 2010358 28344 625 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [2010378] 167 2010378 3031970 2597166 23359488 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3265497] 0 3265497 28344 505 86016 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3265519] 167 3265519 4062035 3793136 31653888 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3266802] 0 3266802 28344 687 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3266822] 167 3266822 4497716 4230869 35123200 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3270975] 0 3270975 28344 807 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3270994] 167 3270994 3967122 3729950 30855168 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3274123] 0 3274123 28344 657 90112 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3274144] 167 3274144 4494897 4206707 35098624 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [3755838] 0 3755838 28344 443 81920 0 -998 containerd-shim
[Wed Apr 21 09:09:26 2021] [3755859] 167 3755859 2725350 2521164 20897792 0 1000 ceph-osd
[Wed Apr 21 09:09:26 2021] [112604] 0 112604 123433 1400 159744 0 -998 runc
[Wed Apr 21 09:09:26 2021] [112605] 0 112605 5978 1275 73728 0 -998 runc
[Wed Apr 21 09:09:26 2021] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=d87a901cec4c109cd26e6832847d1a22b4e61f4b7563c93c2db0294a5f0ba81e,mems_allowed=0-1,global_oom,task_memcg=/kubepods/besteffort/pod9883fed2-c3d8-47d5-9f69-2e6b7176bc13/dde4e8592a3ee00e7e1d523095d8d1939593d01f6ca5fc5b921588f5a7f5808c,task=ceph-osd,pid=3266822,uid=167
[Wed Apr 21 09:09:26 2021] Out of memory: Killed process 3266822 (ceph-osd) total-vm:17990864kB, anon-rss:16923476kB, file-rss:0kB, shmem-rss:0kB, UID:167
[Wed Apr 21 09:09:28 2021] oom_reaper: reaped process 3266822 (ceph-osd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[Wed Apr 21 09:09:32 2021] iptables[113272]: segfault at 88 ip 00007fb815b80e47 sp 00007ffd77560418 error 4 in libnftnl.so.11.3.0[7fb815b7c000+16000]
[Wed Apr 21 09:09:32 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:20:02 2021] iptables[135288]: segfault at 88 ip 00007f5b7bddee47 sp 00007ffc20ab5188 error 4 in libnftnl.so.11.3.0[7f5b7bdda000+16000]
[Wed Apr 21 09:20:02 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:46:51 2021] iptables[191860]: segfault at 88 ip 00007f8860230e47 sp 00007fffc4f15388 error 4 in libnftnl.so.11.3.0[7f886022c000+16000]
[Wed Apr 21 09:46:51 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 09:48:12 2021] iptables[194703]: segfault at 88 ip 00007fb5fd607e47 sp 00007ffff6459678 error 4 in libnftnl.so.11.3.0[7fb5fd603000+16000]
[Wed Apr 21 09:48:12 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:11:36 2021] iptables[243752]: segfault at 88 ip 00007f90505a3e47 sp 00007ffd1f4a9f38 error 4 in libnftnl.so.11.3.0[7f905059f000+16000]
[Wed Apr 21 10:11:36 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:24:18 2021] iptables[270483]: segfault at 88 ip 00007f378dd25e47 sp 00007fffb363c858 error 4 in libnftnl.so.11.3.0[7f378dd21000+16000]
[Wed Apr 21 10:24:18 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
[Wed Apr 21 10:28:24 2021] IPv6: ADDRCONF(NETDEV_UP): veth3f1f1570: link is not ready
[Wed Apr 21 10:28:24 2021] IPv6: ADDRCONF(NETDEV_CHANGE): veth3f1f1570: link becomes ready
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered blocking state
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 10:28:24 2021] device veth3f1f1570 entered promiscuous mode
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered blocking state
[Wed Apr 21 10:28:24 2021] cni0: port 1(veth3f1f1570) entered forwarding state
[Wed Apr 21 10:28:26 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 10:28:26 2021] device veth3f1f1570 left promiscuous mode
[Wed Apr 21 10:28:26 2021] cni0: port 1(veth3f1f1570) entered disabled state
[Wed Apr 21 11:25:59 2021] iptables[400663]: segfault at 88 ip 00007fce8ba41e47 sp 00007ffca2b13c08 error 4 in libnftnl.so.11.3.0[7fce8ba3d000+16000]
[Wed Apr 21 11:25:59 2021] Code: bf 88 00 00 00 48 8b 2f 48 39 df 74 13 4c 89 ee 41 ff d4 85 c0 78 0b 48 89 ef 48 8b 6d 00 eb e8 31 c0 5a 5b 5d 41 5c 41 5d c3 <48> 8b 87 88 00 00 00 48 81 c7 88 00 00 00 48 39 f8 74 0b 85 f6 74
Could this be the underlying issue that you are referring to?
Thanks,
Adrian
Updated by Kefu Chai about 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 3 years ago
- Copied to Backport #50482: octopus: segv in AsyncConnection::_stop() added
Updated by Backport Bot about 3 years ago
- Copied to Backport #50483: pacific: segv in AsyncConnection::_stop() added
Updated by Adrian Dabuleanu almost 3 years ago
Hi,
Any news when this bugfix will be available in a ceph version?
Thanks,
Adrian
Updated by Neha Ojha over 2 years ago
- Has duplicate Bug #52176: crash: std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnecti added
Updated by alexandre derumier over 2 years ago
Hi,
I had 2 crash today, pacific 16.2.6 (I thinked it was fixed in this version ? or is it another bug ?)
{
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f21a34f6140]",
"(AsyncMessenger::unregister_conn(boost::intrusive_ptr<AsyncConnection> const&)+0x70) [0x55f9fb9ebdf0]",
"(AsyncConnection::_stop()+0x5a) [0x55f9fb9e466a]",
"(ProtocolV2::stop()+0x8d) [0x55f9fba0fe2d]",
"(ProtocolV2::_fault()+0x1ab) [0x55f9fba1010b]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x55f9fba10a71]",
"(AsyncConnection::process()+0x511) [0x55f9fb9e8321]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x141) [0x55f9fb8195f1]",
"/usr/bin/ceph-osd(+0x13da062) [0x55f9fb81f062]",
"/lib/x86_64-linux-gnu/libstdc+.so.6(+0xceed0) [0x7f21a3379ed0]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f21a34eaea7]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-09-30T17:47:12.895238Z_57fa9a38-72fc-48d9-bef0-db850a52e848",
"entity_name": "osd.4",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-osd",
"stack_sig": "15a9fc1118d0f904bb1aa31fd4ea165498353da0ef33f672252c702653a09b72",
"timestamp": "2021-09-30T17:47:12.895238Z",
"utsname_hostname": "mindceph1-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-8-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.46-4 (2021-08-03)"
}
{
"archived": "2021-10-01 13:42:01.114671",
"backtrace": [
"/lib/x86_64-linux-gnu/libpthread.so.0(0x14140) [0x7f1655d42140]",
"(AsyncMessenger::unregister_conn(boost::intrusive_ptr<AsyncConnection> const&)+0x70) [0x55e27c310df0]",
"(AsyncConnection::_stop()+0x5a) [0x55e27c30966a]",
"(ProtocolV2::stop()+0x8d) [0x55e27c334e2d]",
"(ProtocolV2::_fault()+0x1ab) [0x55e27c33510b]",
"(ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x31) [0x55e27c335a71]",
"(AsyncConnection::process()+0x511) [0x55e27c30d321]",
"(EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x141) [0x55e27c13e5f1]",
"/usr/bin/ceph-osd(+0x13da062) [0x55e27c144062]",
"/lib/x86_64-linux-gnu/libstdc+.so.6(+0xceed0) [0x7f1655bc5ed0]",
"/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f1655d36ea7]",
"clone()"
],
"ceph_version": "16.2.6",
"crash_id": "2021-09-30T09:17:13.122241Z_769ca0cc-a96c-4c5a-a624-87030b22c98f",
"entity_name": "osd.4",
"os_id": "11",
"os_name": "Debian GNU/Linux 11 (bullseye)",
"os_version": "11 (bullseye)",
"os_version_id": "11",
"process_name": "ceph-osd",
"stack_sig": "15a9fc1118d0f904bb1aa31fd4ea165498353da0ef33f672252c702653a09b72",
"timestamp": "2021-09-30T09:17:13.122241Z",
"utsname_hostname": "mindceph1-1.odiso.net",
"utsname_machine": "x86_64",
"utsname_release": "5.10.0-8-amd64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Debian 5.10.46-4 (2021-08-03)"
9
ceph crash info 2021-09-30T09:17:13.122241Z_769ca0cc-a96c-4c5a-a624-87030b22c98f
Updated by Radoslaw Zarzynski over 2 years ago
- Has duplicate Bug #51527: Ceph osd crashed due to segfault added
Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".