Project

General

Profile

Bug #50089

mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of monitors in the cluster

Added by Neha Ojha over 1 year ago. Updated 2 days ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

1bfa48148eee52e245e1d06fc24c58f9ce7afcb91c369a99f37e45a25aa52f83
3f975382699c40c8ac1ac12dba2e974a050365cf6f4cdb7efa680f93e6c14d49
3fe4f79db30422625bc0d5d967e6570000d166f2a621a96fc832dc27fedc31bf
4bd6d829bdd117a5f4c7f03eb85e9b6e889d009090af74421f1a21ac9bab4be6
541337866aef4900d3c4ab536694b5efa3c48a1422b9fdc2c35fbcd441614b4a
58ae1b1868b4566ed94ce7798ff840e0e611b4646bcddb014639cedfc6a7901f
5ad55dd4483662974892618f86c3484c74c939979ab6f781bdc165c297983a0f
614b25a1a3fff2ae344523df3d7f2d377ad653ea2f3cd14bc73a11f65551dd5c
6defcf68dc501e6ad721f1bb9154bb98d1b519cbfe8b1718a1497aeeae5a4517
7bb10076aaa32ffda8244ebce0ef12ba522af1d5162605c6225ce32b2b53d815
809365b772c688f5bcf09a9bddada817ea66f5a8ad30a12ce43068e74e56a0c9
96b49c839d59492286f04a76ececd021835a660aabcfedc92ead1b3b31aa9978
d8860eca1bbb09f5149f02114b62a0dcdfa0a65a399c9a628a3fa3f190518025
d92e036dce71f761a510d23ba1d3b7a857fc9c9ea01f60a363a91616dd74f28f
e932678c4790d707352613c005dc6074c072924c65db73dba35ad06ed159e3d7
e9e13cf41d815dd96f1d1014f9c144fa8e74c842164e8ed8d4fd4c268491ce16
f3fc8fc7e2bdbb7d14f1f6b000ef63b360ff153d1e8e73c9410b953559e71249
fe1851c46283d7dee4fed131b4bdac681635f617e9027d115f3ea0c1953550bf
13c496a12d26e28acfc1dc4160ed38576c2cc3e861548aaa98339fb85a031a40
4c54fe2531a5e37e16f9c4db836d98177671863fdf9b32160e49226edc2526b3
505059af5eee43a87e362af522e1b2d59d4b50af74fcac678a3de70c7caad121
f039aa5eece36634a4a9b4d9d6aff95374aa2950e775b18d7c609a2b2aa98e4a
227a4ff681489d6f6f93a5c516587508339956e8339b004849647dccedec5d71
505bc4de5eb8aec6e7f6b83c3d30f7b964c030ba9ca296c3b0f2543476258d8d
5ed46198af542faafdabb96d8b4189d853d082495671bce1412b4d54e0b347a2


Description

    -2> 2021-03-31T14:28:43.137+0000 7f348c4f3700  5 mon.pluto002@0(electing).elector(23)  so far i have { mon.0: features 4540138297136906239 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging]), mon.2: features 4540138297136906239 mon_feature_t([kraken,luminous,mimic,osdmap-prune,nautilus,octopus,pacific,elector-pinging]) }
    -1> 2021-03-31T14:28:43.420+0000 7f348ecf8700 -1 /builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7f348ecf8700 time 2021-03-31T14:28:43.421216+0000
/builddir/build/BUILD/ceph-16.1.0-1323-g7e7e1f4e/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())
 ceph version 16.1.0-1323.el8cp (46ac37397f0332c20aceceb8022a1ac1ddf8fa73) pacific (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f349a0693b8]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2765d2) [0x7f349a0695d2]
 3: (Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55a4b92a5868]
 4: (Elector::ping_check(int)+0x30f) [0x55a4b92a618f]
 5: (Context::complete(int)+0xd) [0x55a4b9226fdd]
 6: (SafeTimer::timer_thread()+0x1b7) [0x7f349a157be7]
 7: (SafeTimerThread::entry()+0x11) [0x7f349a1591c1]
 8: /lib64/libpthread.so.0(+0x815a) [0x7f3497b5d15a]
 9: clone()

Steps to reproduce: reduce number of monitors from 5 to 3
Workaround: turn the crashed monitor back on (since the crash is a transient error)
Source: https://bugzilla.redhat.com/show_bug.cgi?id=1945266


Related issues

Related to RADOS - Bug #50088: rados: qa: suites do not test mon removal New
Related to RADOS - Bug #55695: Shutting down a monitor forces Paxos to restart and sometimes disregard subsequent commands Fix Under Review
Related to RADOS - Bug #58155: mon:ceph_assert(m < ranks.size()) `different code path than tracker 50089` In Progress
Duplicated by RADOS - Bug #52183: crash: const entity_addrvec_t& MonMap::get_addrs(unsigned int) const: assert(m < ranks.size()) Duplicate
Duplicated by RADOS - Bug #52170: crash: const entity_addrvec_t& MonMap::get_addrs(unsigned int) const: assert(m < ranks.size()) Duplicate
Duplicated by RADOS - Bug #54529: mon/mon-bind.sh: Failure due to cores found Duplicate
Copied to RADOS - Backport #57704: quincy: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of monitors in the cluster Resolved
Copied to RADOS - Backport #57705: pacific: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of monitors in the cluster Resolved

History

#1 Updated by Neha Ojha over 1 year ago

  • Related to Bug #50088: rados: qa: suites do not test mon removal added

#2 Updated by Neha Ojha over 1 year ago

showed up in a pacific->master upgrade test

2021-04-19T19:33:42.749 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3406-gb9575dc7/rpm/el8/BUILD/ceph-17.0.0-3406-gb9575dc7/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7fc266907700 time 2021-04-19T19:33:42.396505+0000
2021-04-19T19:33:42.749 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-3406-gb9575dc7/rpm/el8/BUILD/ceph-17.0.0-3406-gb9575dc7/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())
2021-04-19T19:33:42.749 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  ceph version 17.0.0-3406-gb9575dc7 (b9575dc757ca28607c59f4051113e4a25ed8728b) quincy (dev)
2021-04-19T19:33:42.749 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7fc270c8a518]
2021-04-19T19:33:42.750 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x27c720) [0x7fc270c8a720]
2021-04-19T19:33:42.750 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  3: (Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55c28753dea8]
2021-04-19T19:33:42.750 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  4: (Elector::ping_check(int)+0x30f) [0x55c28753e7cf]
2021-04-19T19:33:42.751 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  5: (Context::complete(int)+0xd) [0x55c2874bdbed]
2021-04-19T19:33:42.751 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  6: (SafeTimer::timer_thread()+0x1c0) [0x7fc270d901a0]
2021-04-19T19:33:42.751 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  7: (SafeTimerThread::entry()+0x11) [0x7fc270d92d41]
2021-04-19T19:33:42.751 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  8: (Thread::_entry_func(void*)+0xd) [0x7fc270d81e1d]
2021-04-19T19:33:42.752 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  9: /lib64/libpthread.so.0(+0x814a) [0x7fc26e77414a]
2021-04-19T19:33:42.752 INFO:journalctl@ceph.mon.a.smithi082.stdout:Apr 19 19:33:42 smithi082 conmon[74090]:  10: clone()

rados/upgrade/pacific-x/parallel/{0-start 1-tasks distro1$/{rhel_8.3_kubic_stable} mon_election/classic upgrade-sequence workload/{ec-rados-default rados_api rados_loadgenbig rbd_import_export test_rbd_api test_rbd_python}}

/a/nojha-2021-04-19_18:28:32-rados-master-distro-basic-smithi/6059600

#3 Updated by Tejas C over 1 year ago

I see a similar crash on quincy, suspect its seen when I try to add mons from 1 to 3 .

/]# ceph crash info 2021-06-28T06:22:31.271856Z_784b13d6-24fd-43ad-9ff2-79a1317cd3d0 {
"assert_condition": "m < ranks.size()",
"assert_file": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-5278-g79eb0c85/rpm/el8/BUILD/ceph-17.0.0-5278-g79eb0c85/src/mon/MonMap.h",
"assert_func": "const entity_addrvec_t& MonMap::get_addrs(unsigned int) const",
"assert_line": 404,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-5278-g79eb0c85/rpm/el8/BUILD/ceph-17.0.0-5278-g79eb0c85/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7f7c985aa700 time 2021-06-28T06:22:31.267650+0000\n/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-5278-g79eb0c85/rpm/el8/BUILD/ceph-17.0.0-5278-g79eb0c85/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())\n",
"assert_thread_name": "safe_timer",
"backtrace": [
"/lib64/libpthread.so.0(+0x12b20) [0x7f7ca01f0b20]",
"gsignal()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a0) [0x7f7ca26fcc4a]",
"/usr/lib64/ceph/libceph-common.so.2(+0x27ce04) [0x7f7ca26fce04]",
"(Elector::send_peer_ping(int, utime_t const*)+0x448) [0x56180a614608]",
"(Elector::ping_check(int)+0x30f) [0x56180a614f2f]",
"(Context::complete(int)+0xd) [0x56180a59497d]",
"(SafeTimer::timer_thread()+0x1c0) [0x7f7ca2801c90]",
"(SafeTimerThread::entry()+0x11) [0x7f7ca2804831]",
"(Thread::_entry_func(void*)+0xd) [0x7f7ca27f390d]",
"/lib64/libpthread.so.0(+0x814a) [0x7f7ca01e614a]",
"clone()"
],
"ceph_version": "17.0.0-5278-g79eb0c85",
"crash_id": "2021-06-28T06:22:31.271856Z_784b13d6-24fd-43ad-9ff2-79a1317cd3d0",
"entity_name": "mon.clara001",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mon",
"stack_sig": "ec1f126f91754dcb8260f74bec942786e56cfe2e728d1c50c45bc9a62fd40586",
"timestamp": "2021-06-28T06:22:31.271856Z",
"utsname_hostname": "clara001",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-305.el8.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Thu Apr 29 08:54:30 EDT 2021"
}

#4 Updated by Neha Ojha over 1 year ago

  • Priority changed from High to Urgent

#5 Updated by Neha Ojha over 1 year ago

  • Duplicated by Bug #52183: crash: const entity_addrvec_t& MonMap::get_addrs(unsigned int) const: assert(m < ranks.size()) added

#6 Updated by Neha Ojha about 1 year ago

  • Duplicated by Bug #52170: crash: const entity_addrvec_t& MonMap::get_addrs(unsigned int) const: assert(m < ranks.size()) added

#7 Updated by Neha Ojha 11 months ago

  • Backport changed from pacific to pacific,quincy

#8 Updated by Neha Ojha 10 months ago

  • Assignee changed from Greg Farnum to Kamoltat (Junior) Sirivadhna

#9 Updated by Kamoltat (Junior) Sirivadhna 10 months ago

  • Status changed from New to Fix Under Review

#10 Updated by Kamoltat (Junior) Sirivadhna 10 months ago

  • Pull request ID set to 44993

#11 Updated by Kamoltat (Junior) Sirivadhna 9 months ago

Update:

We ran and analyzed a total of 3 runs:

Local vstart stopping the monitors before removing the monitors out of clusters.
Local vstart removing the monitors out of clusters before stopping the monitors.
Teuthology `ceph orch apply mon 3` to reduce monitors.

Teuthology run, putting 5 monitors in 5 different hosts and reducing them to 3 monitors using the command `ceph orch apply mon 3` recreated the bug. Local vstart run with 5 monitors in 1 node, reducing the number of monitors down by manually stopping them, then removing them did not recreate the bug. However, a local vstart run, reducing the monitors down by manually removing them first before stopping does recreate the bug with similar behavior in the elector code, we can observe by comparing mon.a (leader) logs of both teuthology run and local vstart remove before stop.

By comparing my two attempts in using the local vstart cluster to reproduce the bug we can conclude that removing the monitor from the cluster before stopping the monitor (shutting it down) actually recreated the bug. According to https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#removing-a-monitor-manual The correct manual process of removing a monitor should be in this order:

1. stopping the monitor (shutting it down)
2. removing the monitor from the cluster

This leads to us questioning the mechanism behind the command: `ceph orch apply mon 3` (the command that was used to reduce the monitor from 5 down to 3). Is it possible that reducing the monitors through ceph orch command removes the monitors before shutting it down? We shall confirm this and fix it if need be.

In any case, ceph should not crash even if the user removes the monitor before shutting it down. Therefore, we shall proceed with creating a sanity check for when the rank of the quorum gets reduced before the monitor becomes unreachable through pinging.

Here is the PR I am currently working on with this sanity check: https://github.com/ceph/ceph/pull/44993

For full details regarding the logs I have collected and my full analysis: https://docs.google.com/document/d/1za8dl4lu2wygKfrQAP0NEFAAy6DJ45KcjmGNQTz5Nmc/edit?usp=sharing

#12 Updated by Telemetry Bot 9 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.0.0, v16.2.0, v16.2.1, v16.2.4, v16.2.5, v16.2.6, v16.2.7 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=96b49c839d59492286f04a76ececd021835a660aabcfedc92ead1b3b31aa9978

Assert condition: m < ranks.size()
Assert function: const entity_addrvec_t& MonMap::get_addrs(unsigned int) const

Sanitized backtrace:

    /lib64/libpthread.so.0(
    /usr/lib64/ceph/libceph-common.so.2(
    Elector::send_peer_ping(int, utime_t const*)
    Elector::begin_peer_ping(int)
    Elector::handle_ping(boost::intrusive_ptr<MonOpRequest>)
    Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)
    Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)
    Monitor::_ms_dispatch(Message*)
    Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)
    DispatchQueue::entry()
    DispatchQueue::DispatchThread::entry()
    /lib64/libpthread.so.0(
    clone()

Crash dump sample:
{
    "assert_condition": "m < ranks.size()",
    "assert_file": "mon/MonMap.h",
    "assert_func": "const entity_addrvec_t& MonMap::get_addrs(unsigned int) const",
    "assert_line": 404,
    "assert_msg": "mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7f00f8918700 time 2022-03-11T05:03:35.740526+0000\nmon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f0103f95ce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f0106259ba3]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f0106259d6c]",
        "(Elector::send_peer_ping(int, utime_t const*)+0x448) [0x5580d016bf18]",
        "(Elector::begin_peer_ping(int)+0x1eb) [0x5580d016c14b]",
        "(Elector::handle_ping(boost::intrusive_ptr<MonOpRequest>)+0x27b) [0x5580d016d11b]",
        "(Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xb3) [0x5580d016dd03]",
        "(Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xb0b) [0x5580d00e8ddb]",
        "(Monitor::_ms_dispatch(Message*)+0x670) [0x5580d00ea080]",
        "(Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x5580d0118f4c]",
        "(DispatchQueue::entry()+0x126a) [0x7f010649daba]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f010654f5d1]",
        "/lib64/libpthread.so.0(+0x81cf) [0x7f0103f8b1cf]",
        "clone()" 
    ],
    "ceph_version": "16.2.7",
    "crash_id": "2022-03-11T05:03:35.744091Z_e0bbc4d8-76cc-4763-a41b-dcd8a34cd463",
    "entity_name": "mon.e36bb386e509bfd98b0ff41679d32d33d6a9f6e1",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mon",
    "stack_sig": "e9e13cf41d815dd96f1d1014f9c144fa8e74c842164e8ed8d4fd4c268491ce16",
    "timestamp": "2022-03-11T05:03:35.744091Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.13.0-35-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#40~20.04.1-Ubuntu SMP Mon Mar 7 09:18:32 UTC 2022" 
}

#13 Updated by Telemetry Bot 9 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.3 added

#14 Updated by Telemetry Bot 9 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)

#15 Updated by Telemetry Bot 9 months ago

  • Crash signature (v1) updated (diff)

#16 Updated by Laura Flores 9 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)

/a/teuthology-2022-01-09_07:01:02-rados-master-distro-default-smithi/6604561
/a/yuriw-2022-03-10_02:41:10-rados-wip-yuri3-testing-2022-03-09-1350-distro-default-smithi/6729296
/a/yuriw-2022-03-19_14:39:53-rados-quincy-distro-default-smithi/6747175

2022-01-10T19:57:37.509 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-bind.sh:105: TEST_mon_quorum:  ceph quorum_status --format=json
2022-01-10T19:57:37.632 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 19:57:37 socat[40133] E write(6, 0x55c37d3266c0, 153): Broken pipe
2022-01-10T19:57:48.484 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 19:57:48 socat[40189] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused

...

2022-01-10T20:47:03.286 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:03 socat[42705] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:03.287 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:03 socat[42706] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused
2022-01-10T20:47:05.351 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:05 socat[42707] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:06.683 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:06 socat[42708] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused
2022-01-10T20:47:18.300 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:18 socat[42717] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:18.302 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:18 socat[42718] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:18.303 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:18 socat[42719] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused
2022-01-10T20:47:20.367 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:20 socat[42720] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:21.699 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:21 socat[42721] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused
2022-01-10T20:47:33.313 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:33 socat[42722] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:33.318 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:33 socat[42723] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:33.319 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:33 socat[42724] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused
2022-01-10T20:47:35.383 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:35 socat[42725] E connect(5, AF=2 127.0.0.1:7137, 16): Connection refused
2022-01-10T20:47:36.715 INFO:tasks.workunit.client.0.smithi038.stderr:2022/01/10 20:47:36 socat[42726] E connect(5, AF=2 127.0.0.1:7136, 16): Connection refused

...

2022-01-10T20:47:37.606 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-bind.sh:105: TEST_mon_quorum:  jqinput=
2022-01-10T20:47:37.607 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-bind.sh:106: TEST_mon_quorum:  jq_success '' '.monmap.mons | length == 3'

...

2022-01-10T20:47:37.733 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:199: teardown:  rm -rf /tmp/ceph-asok.39613
2022-01-10T20:47:37.734 INFO:tasks.workunit.client.0.smithi038.stdout:ERROR: Failure due to cores found
2022-01-10T20:47:37.735 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-01-10T20:47:37.736 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:200: teardown:  '[' yes = yes ']'
2022-01-10T20:47:37.736 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:201: teardown:  echo 'ERROR: Failure due to cores found'
2022-01-10T20:47:37.737 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:202: teardown:  '[' -n '' ']'
2022-01-10T20:47:37.737 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:205: teardown:  return 1
2022-01-10T20:47:37.737 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:2263: main:  return 1
2022-01-10T20:47:37.738 INFO:tasks.workunit:Stopping ['mon'] on client.0...
2022-03-10T08:45:17.921+0000 7ff1dcbe9700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11051-gb5ca1477/rpm/el8/BUILD/ceph-17.0.0-11051-gb5ca1477/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7ff1dcbe9700 time 2022-03-10T08:45:17.921667+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-11051-gb5ca1477/rpm/el8/BUILD/ceph-17.0.0-11051-gb5ca1477/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())

 ceph version 17.0.0-11051-gb5ca1477 (b5ca14771f888fa46234c86404e0aee4913031fe) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7ff1e8ea1204]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x284425) [0x7ff1e8ea1425]
 3: (Elector::send_peer_ping(int, utime_t const*)+0x440) [0x561824c71cc0]
 4: (Elector::begin_peer_ping(int)+0x1eb) [0x561824c71eeb]
 5: (Elector::handle_ping(boost::intrusive_ptr<MonOpRequest>)+0x281) [0x561824c72eb1]
 6: (Elector::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xc5) [0x561824c73a95]
 7: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xc1d) [0x561824bd393d]
 8: (Monitor::_ms_dispatch(Message*)+0x457) [0x561824bd4ad7]
 9: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x561824c0505c]
 10: (DispatchQueue::entry()+0x14fa) [0x7ff1e912873a]
 11: (DispatchQueue::DispatchThread::entry()+0x11) [0x7ff1e91dfa61]
 12: /lib64/libpthread.so.0(+0x817f) [0x7ff1e6e2917f]
 13: clone()

Expected output from a successful run:

/a/yuriw-2022-03-18_00:42:20-rados-wip-yuri6-testing-2022-03-17-1547-distro-default-smithi/6743533

2022-03-18T02:51:06.225 INFO:tasks.workunit.client.0.smithi049.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-bind.sh:105: TEST_mon_quorum:  ceph quorum_status --format=json
2022-03-18T02:51:09.756 INFO:tasks.workunit.client.0.smithi049.stderr:2022/03/18 02:51:09 socat[41600] E write(6, 0x561c37c116c0, 68): Broken pipe
2022-03-18T02:51:09.983 INFO:tasks.workunit.client.0.smithi049.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-bind.sh:105: TEST_mon_quorum:  jqinput='
2022-03-18T02:51:09.983 INFO:tasks.workunit.client.0.smithi049.stderr:{"election_epoch":4,"quorum":[0,1,2],"quorum_names":["a","b","c"],"quorum_leader_name":"a","quorum_age":1,"features":{"quorum_con":"4540138303579357183","quorum_mon":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus","pacific","elector-pinging","quincy"]},"monmap":{"epoch":1,"fsid":"c8f27b4a-0323-4ffc-8154-da3f6a79cb0d","modified":"2022-03-18T02:51:03.949146Z","created":"2022-03-18T02:51:03.949146Z","min_mon_release":17,"min_mon_release_name":"quincy","election_strategy":1,"disallowed_leaders: ":"","stretch_mode":false,"tiebreaker_mon":"","features":{"persistent":["kraken","luminous","mimic","osdmap-prune","nautilus","octopus","pacific","elector-pinging","quincy"],"optional":[]},"mons":[{"rank":0,"name":"a","public_addrs":{"addrvec":[{"type":"v2","addr":"127.0.0.1:7132","nonce":0}]},"addr":"127.0.0.1:7132/0","public_addr":"127.0.0.1:7132/0","priority":0,"weight":0,"crush_location":"{}"},{"rank":1,"name":"b","public_addrs":{"addrvec":[{"type":"v2","addr":"127.0.0.1:7133","nonce":0}]},"addr":"127.0.0.1:7133/0","public_addr":"127.0.0.1:7133/0","priority":0,"weight":0,"crush_location":"{}"},{"rank":2,"name":"c","public_addrs":{"addrvec":[{"type":"v2","addr":"127.0.0.1:7134","nonce":0}]},"addr":"127.0.0.1:7134/0","public_addr":"127.0.0.1:7134/0","priority":0,"weight":0,"crush_location":"{}"}]}}'

#17 Updated by Laura Flores 9 months ago

  • Duplicated by Bug #54529: mon/mon-bind.sh: Failure due to cores found added

#18 Updated by Radoslaw Zarzynski 8 months ago

  • Status changed from Fix Under Review to In Progress

#19 Updated by Neha Ojha 8 months ago

  • Pull request ID changed from 44993 to 45299

#20 Updated by Kamoltat (Junior) Sirivadhna 8 months ago

  • Pull request ID changed from 45299 to 44993

#21 Updated by Telemetry Bot 5 months ago

  • Crash signature (v1) updated (diff)
  • Affected Versions v16.2.9, v17.2.0 added

#22 Updated by Kamoltat (Junior) Sirivadhna 5 months ago

  • Related to Bug #55695: Shutting down a monitor forces Paxos to restart and sometimes disregard subsequent commands added

#23 Updated by Gaurav Sitlani 2 months ago

  • Crash signature (v1) updated (diff)

I am seeing the same crash in version : ceph version 16.2.10 and just noticed that PR linked in this thread is merged. @Kamoltat Sirivadhna Is there any plans to backport and fix this issue in pacific ?

#24 Updated by Gaurav Sitlani 2 months ago

[ceph: root@X /]# ceph crash ls
ID                                                                ENTITY     NEW  
2022-09-28T09:55:37.128747Z_f81dc440-4b96-436d-a272-02c9d094232d  mon.X   *   
[ceph: root@X /]# ceph crash info 2022-09-28T09:55:37.128747Z_f81dc440-4b96-436d-a272-02c9d094232d
{
    "assert_condition": "m < ranks.size()",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/mon/MonMap.h",
    "assert_func": "const entity_addrvec_t& MonMap::get_addrs(unsigned int) const",
    "assert_line": 404,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/mon/MonMap.h: In function 'const entity_addrvec_t& MonMap::get_addrs(unsigned int) const' thread 7f6e23c9b700 time 2022-09-28T09:55:37.125709+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10/rpm/el8/BUILD/ceph-16.2.10/src/mon/MonMap.h: 404: FAILED ceph_assert(m < ranks.size())\n",
    "assert_thread_name": "safe_timer",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f6e2cb20ce0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6e2ede6e39]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x277002) [0x7f6e2ede7002]",
        "(Elector::send_peer_ping(int, utime_t const*)+0x448) [0x55e32ae395d8]",
        "(Elector::ping_check(int)+0x30f) [0x55e32ae39eff]",
        "(Context::complete(int)+0xd) [0x55e32adb938d]",
        "(CommonSafeTimer<std::mutex>::timer_thread()+0x10f) [0x7f6e2eedcc5f]",
        "(CommonSafeTimerThread<std::mutex>::entry()+0x11) [0x7f6e2eeddff1]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f6e2cb161ca]",
        "clone()" 
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2022-09-28T09:55:37.128747Z_f81dc440-4b96-436d-a272-02c9d094232d",
    "entity_name": "mon.X",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mon",
    "stack_sig": "45b9556ff37ea7005e07c92442f15b857a77cd73ba7cb0eca73b9478834f57c8",
    "timestamp": "2022-09-28T09:55:37.128747Z",
    "utsname_hostname": "X",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-408.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Mon Jul 18 17:42:52 UTC 2022" 
}

#25 Updated by Neha Ojha 2 months ago

  • Status changed from In Progress to Pending Backport

#26 Updated by Backport Bot 2 months ago

  • Copied to Backport #57704: quincy: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of monitors in the cluster added

#27 Updated by Backport Bot 2 months ago

  • Copied to Backport #57705: pacific: mon/MonMap.h: FAILED ceph_assert(m < ranks.size()) when reducing number of monitors in the cluster added

#28 Updated by Backport Bot 2 months ago

  • Tags set to backport_processed

#29 Updated by Kamoltat (Junior) Sirivadhna 24 days ago

  • Status changed from Pending Backport to Resolved

#30 Updated by Kamoltat (Junior) Sirivadhna 2 days ago

  • Status changed from Resolved to New

#31 Updated by Kamoltat (Junior) Sirivadhna 2 days ago

  • Status changed from New to Resolved

#32 Updated by Kamoltat (Junior) Sirivadhna 1 day ago

  • Related to Bug #58155: mon:ceph_assert(m < ranks.size()) `different code path than tracker 50089` added

Also available in: Atom PDF