Bug #40777
openhit assert in AuthMonitor::update_from_paxos
0%
0810347d2b7e07374c20a010e7ca411ca125d50c01528c53c2e7d8556e1ef3ee
0d54c46a0fb63aa4b54e651261bdf5fd8f1a40f76131cbee615df5e5b8d412c7
3fdd4ce4b0d6529edfd9648756a49bcc247aa294fddd59d16b79cbb5b5fc93a8
47453f0f1c52fc416989fa4834738e749589093ee2e023ce22ca034c17441e3f
5cc0dd2441a47d889aadcd5ab5fd98685bea94235e4c052c8eb2fe64819d6012
5e65c408af9278ef1649d782df52b4b944463a56a06c5f5a4628735c3ed71329
69e63e0e1ca4aadc0d3ce45b8820b2717654642518a78a41783b1010b47c2334
c961c55e7b3d0a454cbb0f8ba37201e2a25f07d575dfc1303cab2f7b211cbab5
d824e0db27c500cc0793b2f262a6d29db2cb075bb14b2162b72d382f232bec9a
e4de3f9b33f756f03d809432dc35f5e9bbef8eba284d91af5905a591eb680b92
ed678e97c3a893fd090c48f06f8931a04d01db69fcf059c0eed46070360cc8bc
Description
I created the ceph cluster by the rook(https://github.com/rook/rook), and ceph version is 12.2.7 stable.
After I reboot the host, 1 of 3 monitors hit the assert:
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f4bf21c3160 2: (AuthMonitor::update_from_paxos(bool*)+0x159d) [0x7f4bf1fd50fd] 3: (PaxosService::refresh(bool*)+0x1ae) [0x7f4bf20a0d7e] 4: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f4bf1f6a57b] 5: (Monitor::init_paxos()+0x115) [0x7f4bf1f6a9c5] 6: (Monitor::preinit()+0x9c6) [0x7f4bf1f6b3e6] 7: (main()+0x4012) [0x7f4bf1e9b042] 8: (__libc_start_main()+0xf5) [0x7f4bee24c445] 9: (()+0x3afd5e) [0x7f4bf1f3fd5e] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Just found some similar but very old issues, and the fix are in the code already.
Updated by Joao Eduardo Luis almost 5 years ago
- Project changed from Ceph to RADOS
- Category set to Correctness/Safety
- Priority changed from Normal to High
- Source set to Community (user)
- Component(RADOS) Monitor added
Is this reproducible? If so, can you add mon logs (ideally both for peons and leader), at 'debug mon = 10', 'debug paxos = 10', and 'debug ms = 1'?
Updated by sdkfzv sdkfzv almost 5 years ago
rook-ceph-mon0: 0> 2019-07-02 12:16:42.239710 7f4bf1b77ec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f4bf1b77ec0 time 2019-07-02 12:16:42.237387 rook-ceph-mon0: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/mon/AuthMonitor.cc: 157: FAILED assert(ret == 0)
Updated by Greg Farnum almost 5 years ago
- Tracker changed from Bug to Support
- Status changed from New to Closed
- Priority changed from High to Normal
That assert means there was a read error when the monitor tried to get data off of disk. Check your disk!
Updated by sdkfzv sdkfzv almost 5 years ago
Greg Farnum wrote:
That assert means there was a read error when the monitor tried to get data off of disk. Check your disk!
Thanks.
But it seems no error found on disk. I copy the data/store.db/ of the failed monitor to a new monitor, then the new one hit the same assert. So the datas in DB may be wrong.
The return code of the "ret" is "-ENOENT" which means no entity found, so is it possible that the datas were not wrote to disk correctlly? But not sure about this for I lost the log before the host restarting.
Updated by Greg Farnum almost 5 years ago
Ah, ENOENT might be a code bug. Unless you have debug logs of the monitor from when it was writing that data to disk I don't think we can do much with it though. (Also, I think we found and fixed a couple bugs around that since 12.2.7, but I can't find them in a quick search.)
Probably best to wipe the monitor and re-add a new one.
Updated by Brad Hubbard over 3 years ago
https://github.com/facebook/rocksdb/issues/5558 shows the same issue.
Updated by Neha Ojha over 3 years ago
- Related to Bug #40712: ceph-mon crash with assert(err == 0) after rocksdb->get added
Updated by Neha Ojha over 3 years ago
- Status changed from Closed to New
- Priority changed from Normal to High
Updated by Neha Ojha over 2 years ago
- Has duplicate Bug #52178: crash: virtual void AuthMonitor::update_from_paxos(bool*): assert(ret == 0) added
Updated by Neha Ojha over 2 years ago
- Has duplicate Bug #52156: crash: virtual void OSDMonitor::update_from_paxos(bool*): assert(err == 0) added
Updated by Sage Weil over 2 years ago
- Tracker changed from Support to Bug
- Regression set to No
- Severity set to 3 - minor
Updated by Telemetry Bot about 2 years ago
- Crash signature (v1) updated (diff)
- Crash signature (v2) updated (diff)
- Affected Versions v14.2.22, v15.2.10, v15.2.13, v15.2.14, v15.2.15, v15.2.7, v15.2.8, v15.2.9, v16.2.6, v16.2.7 added
Assert condition: ret == 0
Assert function: virtual void AuthMonitor::update_from_paxos(bool*)
Sanitized backtrace:
AuthMonitor::update_from_paxos(bool*) PaxosService::refresh(bool*) Monitor::refresh_from_paxos(bool*) Monitor::init_paxos() Monitor::preinit()
Crash dump sample:
{ "assert_condition": "ret == 0", "assert_file": "mon/AuthMonitor.cc", "assert_func": "virtual void AuthMonitor::update_from_paxos(bool*)", "assert_line": 316, "assert_msg": "mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f519792a700 time 2022-03-13T05:45:59.952887+0000\nmon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)", "assert_thread_name": "ceph-mon", "backtrace": [ "/lib64/libpthread.so.0(+0x12c20) [0x7f518c7e4c20]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f518eaa8ba3]", "/usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f518eaa8d6c]", "(AuthMonitor::update_from_paxos(bool*)+0x2657) [0x563c7513fec7]", "(PaxosService::refresh(bool*)+0x10e) [0x563c751fd29e]", "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c750ae2dc]", "(Monitor::init_paxos()+0x10c) [0x563c750ae5ec]", "(Monitor::preinit()+0xd30) [0x563c750dbaa0]", "main()", "__libc_start_main()", "_start()" ], "ceph_version": "16.2.7", "crash_id": "2022-03-13T05:45:59.957404Z_73f83a21-e2de-499d-bfee-1a322e60dc11", "entity_name": "mon.49425e5435c06178019e534413cc41d992e230b2", "os_id": "centos", "os_name": "CentOS Linux", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mon", "stack_sig": "3fdd4ce4b0d6529edfd9648756a49bcc247aa294fddd59d16b79cbb5b5fc93a8", "timestamp": "2022-03-13T05:45:59.957404Z", "utsname_machine": "x86_64", "utsname_release": "5.13.0-35-generic", "utsname_sysname": "Linux", "utsname_version": "#40-Ubuntu SMP Mon Mar 7 08:03:10 UTC 2022" }
Updated by Radoslaw Zarzynski over 1 year ago
- Related to Bug #58305: src/mon/AuthMonitor.cc: FAILED ceph_assert(version > keys_ver) added