Project

General

Profile

Actions

Bug #54744

open

crash: void MonMap::add(const mon_info_t&): assert(addr_mons.count(a) == 0)

Added by Telemetry Bot about 2 years ago. Updated about 16 hours ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Telemetry
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):

0ab303078f78301b00a08a4683ab26737444aad0204ae8af0f8a8fb705db5424


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=0301792b024ebcd170453531530f95fe622fdcc6bb593bef38045e3654d96bcd

Assert condition: addr_mons.count(a) == 0
Assert function: void MonMap::add(const mon_info_t&)

Sanitized backtrace:

    MonMap::init_with_addrs(std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)
    MonMap::init_with_ips(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)
    MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)

Crash dump sample:
{
    "assert_condition": "addr_mons.count(a) == 0",
    "assert_file": "mon/MonMap.h",
    "assert_func": "void MonMap::add(const mon_info_t&)",
    "assert_line": 221,
    "assert_msg": "mon/MonMap.h: In function 'void MonMap::add(const mon_info_t&)' thread 7fdbfa1e1580 time 2021-12-14T20:00:23.402334+0000\nmon/MonMap.h: 221: FAILED ceph_assert(addr_mons.count(a) == 0)",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fdbfabd93c0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1ad) [0x7fdbfb0f4db0]",
        "/usr/lib/ceph/libceph-common.so.2(+0x265f5d) [0x7fdbfb0f4f5d]",
        "(MonMap::init_with_addrs(std::vector<entity_addrvec_t, std::allocator<entity_addrvec_t> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x3a8) [0x7fdbfb4e80a8]",
        "(MonMap::init_with_ips(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x93) [0x7fdbfb4e84a3]",
        "(MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)+0x837) [0x7fdbfb4ea597]",
        "main()",
        "__libc_start_main()",
        "_start()" 
    ],
    "ceph_version": "16.2.7",
    "crash_id": "2021-12-14T20:00:23.405800Z_673ad289-fd40-4f50-b3d4-e43fa654ecb3",
    "entity_name": "mon.8d7b2c1b42f80e192ec02d5fda7e1d93895fe9e0",
    "os_id": "ubuntu",
    "os_name": "Ubuntu",
    "os_version": "20.04.3 LTS (Focal Fossa)",
    "os_version_id": "20.04",
    "process_name": "ceph-mon",
    "stack_sig": "0ab303078f78301b00a08a4683ab26737444aad0204ae8af0f8a8fb705db5424",
    "timestamp": "2021-12-14T20:00:23.405800Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-91-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#102-Ubuntu SMP Fri Nov 5 16:31:28 UTC 2021" 
}

Actions #1

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v16.2.7 added
Actions #2

Updated by Gabriel Mainberger over 1 year ago

Rook v1.6.5 / Ceph v12.2.9 running on the host network and not inside the Kubernetes SDN caused creating a mon canary deployment with the same host IP as the regular mon pod. This did lead to duplicate mon endpoint entries, which caused several ceph components not starting. See https://github.com/projectsyn/component-rook-ceph/issues/89 for more information.

While this was clear a bug of Rook, duplicate entires should not prevent Ceph components from starting. There is enough information to start, the mon entries just have to be deduplicated. Doing so would make Ceph more robust and prevent downtime.

Example: "172.18.200.162:6789","172.18.200.146:6789","172.18.200.132:6789","172.18.200.132:6789"

Actions #3

Updated by Stefan Kooman 30 days ago

This should be fixed indeed. I wanted to disable msgv1 on this cluster. I already had set the flag "ceph config set mon ms_bind_msgr1 false" after cephadm bootstrap ... but this however did not prevent the new monitors from listening on v1. I Therefore set the address by hand:

ceph mon set-addrs $hostname [ip]:3300

Made a typo and got a core dumped and hit the assert:

./src/mon/MonMap.h: In function 'void MonMap::add(const mon_info_t&)' thread 7a9d0bce5640 time 2024-03-27T10:13:05.714510+0100
./src/mon/MonMap.h: 221: FAILED ceph_assert(addr_mons.count(a) == 0)
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7a9d0d7627d5]
2: /usr/lib/ceph/libceph-common.so.2(+0x162989) [0x7a9d0d762989]
3: /usr/lib/ceph/libceph-common.so.2(+0x195ceb) [0x7a9d0d795ceb]
4: (MonMap::init_with_ips(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, std::basic_string_view<char, std::char_traits<char> >)+0x8b) [0x7a9d0da7b00b]
5: (MonMap::build_initial(ceph::common::CephContext*, bool, std::ostream&)+0x36f) [0x7a9d0da808df]
6: (MonClient::build_initial_monmap()+0x134) [0x7a9d0da61c34]
7: (MonClient::get_monmap_and_config()+0x110) [0x7a9d0da666f0]

This is on 18.2.2

Now I have to figure out how I can fix that, if at all ... or just start over.

Actions #4

Updated by Stefan Kooman 30 days ago

The priority level is set to "minor" ... when the time comes that messenger v1 is deprecated ... operators will disable msgv1 on their upgraded cluster and more incidents like this are bound to happen ... better to get this fixed in time to prevent incidents in the future, and bump the severity level.

Actions #5

Updated by Sergey Borodavkin about 17 hours ago

Seems same story here for Pacific 16.2.15

My prev monmap before change:

0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3
1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2
2: [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] mon.stat1
dumped monmap epoch 12

Mistype address for mon set-addrs commnd, another monitor already have this bind.

ceph mon set-addrs stat1 v2:10.60.11.2:3300

After that all ops related to monmap and config (almost any) will trace MonClient::get_monmap_and_config
I fixed it like this:

  1. Stop monitor with mistyped addr and get his monmap
    systemctl stop ceph-mon@stat1
    ceph-monstore-tool /var/lib/ceph/mon/ceph-stat1 get monmap > monmap_current
    monmaptool --print monmap_current # verify that this problem map
    

    Problem map, duplicated address for stat1 and stat2 monitor:
    0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3
    1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2
    2: v2:10.60.11.2:3300/0 mon.stat1
    dumped monmap epoch 13
    
  2. Copy imported map and change addresses
    cp monmap_current monmap
    monmaptool --rm stat1 monmap
    monmaptool --addv stat1 [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] monmap
    
  3. Inject that map and start the mon service
    ceph-mon --inject-monmap monmap --id stat1
    systemctl start ceph-mon@stat1
    

    At this time other monitors discard election from stat1
    mon.stat2@1(peon).elector(2090) discarding election message: v2:10.60.11.3:3300/0 not in my monmap e13
  4. On another mon host we need same, inject edited map from first host
    # stat2
    systemctl stop ceph-mon@stat2
    ceph-mon --inject-monmap monmap --id stat2
    systemctl start ceph-mon@stat2
    
  5. After election third monitor will pick up latest epoch of monmap and all problem should be resolved =)
    # ceph mon dump
    ...
    0: [v2:10.60.11.1:3300/0,v1:10.60.11.1:6789/0] mon.stat3
    1: [v2:10.60.11.2:3300/0,v1:10.60.11.2:6789/0] mon.stat2
    2: [v2:10.60.11.3:3300/0,v1:10.60.11.3:6789/0] mon.stat1
    dumped monmap epoch 15
    

    You can see epoch increase and binds are fine.
Actions #6

Updated by Konstantin Shalygin about 16 hours ago

  • Category set to Administration/Usability
  • Affected Versions v16.2.15, v17.2.8, v18.2.3, v19.1.0, v20.0.0 added
  • Affected Versions deleted (v16.2.7)
  • Component(RADOS) Monitor added
Actions

Also available in: Atom PDF