Bug #54744
opencrash: void MonMap::add(const mon_info_t&): assert(addr_mons.count(a) == 0)
0%
0ab303078f78301b00a08a4683ab26737444aad0204ae8af0f8a8fb705db5424
Description
Assert condition: addr_mons.count(a) == 0
Assert function: void MonMap::add(const mon_info_t&)
Sanitized backtrace:
MonMap::init_with_addrs(std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > const&, bool, std::basic_string_view<char, std::char_traits<char> >) MonMap::init_with_ips(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >) MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)
Crash dump sample:
{ "assert_condition": "addr_mons.count(a) == 0", "assert_file": "mon/MonMap.h", "assert_func": "void MonMap::add(const mon_info_t&)", "assert_line": 221, "assert_msg": "mon/MonMap.h: In function 'void MonMap::add(const mon_info_t&)' thread 7fdbfa1e1580 time 2021-12-14T20:00:23.402334+0000\nmon/MonMap.h: 221: FAILED ceph_assert(addr_mons.count(a) == 0)", "assert_thread_name": "ceph-mon", "backtrace": [ "/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fdbfabd93c0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ad) [0x7fdbfb0f4db0]", "/usr/lib/ceph/libceph-common.so.2(+0x265f5d) [0x7fdbfb0f4f5d]", "(MonMap::init_with_addrs(std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x3a8) [0x7fdbfb4e80a8]", "(MonMap::init_with_ips(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x93) [0x7fdbfb4e84a3]", "(MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)+0x837) [0x7fdbfb4ea597]", "main()", "__libc_start_main()", "_start()" ], "ceph_version": "16.2.7", "crash_id": "2021-12-14T20:00:23.405800Z_673ad289-fd40-4f50-b3d4-e43fa654ecb3", "entity_name": "mon.8d7b2c1b42f80e192ec02d5fda7e1d93895fe9e0", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "20.04.3 LTS (Focal Fossa)", "os_version_id": "20.04", "process_name": "ceph-mon", "stack_sig": "0ab303078f78301b00a08a4683ab26737444aad0204ae8af0f8a8fb705db5424", "timestamp": "2021-12-14T20:00:23.405800Z", "utsname_machine": "x86_64", "utsname_release": "5.4.0-91-generic", "utsname_sysname": "Linux", "utsname_version": "#102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021" }
Updated by Telemetry Bot about 2 years ago
Updated by Gabriel Mainberger over 1 year ago
Rook v1.6.5 / Ceph v12.2.9 running on the host network and not inside the Kubernetes SDN caused creating a mon canary deployment with the same host IP as the regular mon pod. This did lead to duplicate mon endpoint entries, which caused several ceph components not starting. See https://github.com/projectsyn/component-rook-ceph/issues/89 for more information.
While this was clear a bug of Rook, duplicate entires should not prevent Ceph components from starting. There is enough information to start, the mon entries just have to be deduplicated. Doing so would make Ceph more robust and prevent downtime.
Example: "172.18.200.162:6789","172.18.200.146:6789","172.18.200.132:6789","172.18.200.132:6789"
Updated by Stefan Kooman 30 days ago
This should be fixed indeed. I wanted to disable msgv1 on this cluster. I already had set the flag "ceph config set mon ms_bind_msgr1 false" after cephadm bootstrap ... but this however did not prevent the new monitors from listening on v1. I Therefore set the address by hand:
ceph mon set-addrs $hostname [ip]:3300
Made a typo and got a core dumped and hit the assert:
./src/mon/MonMap.h: In function 'void MonMap::add(const mon_info_t&)' thread 7a9d0bce5640 time 2024-03-27T10:13:05.714510+0100
./src/mon/MonMap.h: 221: FAILED ceph_assert(addr_mons.count(a) == 0)
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7a9d0d7627d5]
2: /usr/lib/ceph/libceph-common.so.2(+0x162989) [0x7a9d0d762989]
3: /usr/lib/ceph/libceph-common.so.2(+0x195ceb) [0x7a9d0d795ceb]
4: (MonMap::init_with_ips(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x8b) [0x7a9d0da7b00b]
5: (MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)+0x36f) [0x7a9d0da808df]
6: (MonClient::build_initial_monmap()+0x134) [0x7a9d0da61c34]
7: (MonClient::get_monmap_and_config()+0x110) [0x7a9d0da666f0]
This is on 18.2.2
Now I have to figure out how I can fix that, if at all ... or just start over.
Updated by Stefan Kooman 30 days ago
The priority level is set to "minor" ... when the time comes that messenger v1 is deprecated ... operators will disable msgv1 on their upgraded cluster and more incidents like this are bound to happen ... better to get this fixed in time to prevent incidents in the future, and bump the severity level.
Updated by Sergey Borodavkin about 17 hours ago
Seems same story here for Pacific 16.2.15
My prev monmap before change:
0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3
1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2
2: [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] mon.stat1
dumped monmap epoch 12
Mistype address for mon set-addrs commnd, another monitor already have this bind.
ceph mon set-addrs stat1 v2:10.60.11.2:3300
After that all ops related to monmap and config (almost any) will trace MonClient::get_monmap_and_config
I fixed it like this:
- Stop monitor with mistyped addr and get his monmap
systemctl stop ceph-mon@stat1 ceph-monstore-tool /var/lib/ceph/mon/ceph-stat1 get monmap > monmap_current monmaptool --print monmap_current # verify that this problem map
Problem map, duplicated address for stat1 and stat2 monitor:0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3 1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2 2: v2:10.60.11.2:3300/0 mon.stat1 dumped monmap epoch 13
- Copy imported map and change addresses
cp monmap_current monmap monmaptool --rm stat1 monmap monmaptool --addv stat1 [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] monmap
- Inject that map and start the mon service
ceph-mon --inject-monmap monmap --id stat1 systemctl start ceph-mon@stat1
At this time other monitors discard election from stat1mon.stat2@1(peon).elector(2090) discarding election message: v2:10.60.11.3:3300/0 not in my monmap e13
- On another mon host we need same, inject edited map from first host
# stat2 systemctl stop ceph-mon@stat2 ceph-mon --inject-monmap monmap --id stat2 systemctl start ceph-mon@stat2
- After election third monitor will pick up latest epoch of monmap and all problem should be resolved =)
# ceph mon dump ... 0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3 1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2 2: [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] mon.stat1 dumped monmap epoch 15
You can see epoch increase and binds are fine.
Updated by Konstantin Shalygin about 16 hours ago
- Category set to Administration/Usability
- Affected Versions v16.2.15, v17.2.8, v18.2.3, v19.1.0, v20.0.0 added
- Affected Versions deleted (
v16.2.7) - Component(RADOS) Monitor added