Project

General

Profile

Actions

Bug #40777

open

hit assert in AuthMonitor::update_from_paxos

Added by sdkfzv sdkfzv almost 5 years ago. Updated about 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):

0810347d2b7e07374c20a010e7ca411ca125d50c01528c53c2e7d8556e1ef3ee
0d54c46a0fb63aa4b54e651261bdf5fd8f1a40f76131cbee615df5e5b8d412c7
3fdd4ce4b0d6529edfd9648756a49bcc247aa294fddd59d16b79cbb5b5fc93a8
47453f0f1c52fc416989fa4834738e749589093ee2e023ce22ca034c17441e3f
5cc0dd2441a47d889aadcd5ab5fd98685bea94235e4c052c8eb2fe64819d6012
5e65c408af9278ef1649d782df52b4b944463a56a06c5f5a4628735c3ed71329
69e63e0e1ca4aadc0d3ce45b8820b2717654642518a78a41783b1010b47c2334
c961c55e7b3d0a454cbb0f8ba37201e2a25f07d575dfc1303cab2f7b211cbab5
d824e0db27c500cc0793b2f262a6d29db2cb075bb14b2162b72d382f232bec9a
e4de3f9b33f756f03d809432dc35f5e9bbef8eba284d91af5905a591eb680b92
ed678e97c3a893fd090c48f06f8931a04d01db69fcf059c0eed46070360cc8bc


Description

I created the ceph cluster by the rook(https://github.com/rook/rook), and ceph version is 12.2.7 stable.
After I reboot the host, 1 of 3 monitors hit the assert:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f4bf21c3160
2: (AuthMonitor::update_from_paxos(bool*)+0x159d) [0x7f4bf1fd50fd]
3: (PaxosService::refresh(bool*)+0x1ae) [0x7f4bf20a0d7e]
4: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f4bf1f6a57b]
5: (Monitor::init_paxos()+0x115) [0x7f4bf1f6a9c5]
6: (Monitor::preinit()+0x9c6) [0x7f4bf1f6b3e6]
7: (main()+0x4012) [0x7f4bf1e9b042]
8: (__libc_start_main()+0xf5) [0x7f4bee24c445]
9: (()+0x3afd5e) [0x7f4bf1f3fd5e]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Just found some similar but very old issues, and the fix are in the code already.


Related issues 4 (2 open2 closed)

Related to RADOS - Bug #40712: ceph-mon crash with assert(err == 0) after rocksdb->getNew

Actions
Related to RADOS - Bug #58305: src/mon/AuthMonitor.cc: FAILED ceph_assert(version > keys_ver)Need More Info

Actions
Has duplicate RADOS - Bug #52178: crash: virtual void AuthMonitor::update_from_paxos(bool*): assert(ret == 0)Duplicate

Actions
Has duplicate RADOS - Bug #52156: crash: virtual void OSDMonitor::update_from_paxos(bool*): assert(err == 0)Duplicate

Actions
Actions #1

Updated by Joao Eduardo Luis almost 5 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Priority changed from Normal to High
  • Source set to Community (user)
  • Component(RADOS) Monitor added

Is this reproducible? If so, can you add mon logs (ideally both for peons and leader), at 'debug mon = 10', 'debug paxos = 10', and 'debug ms = 1'?

Actions #2

Updated by sdkfzv sdkfzv almost 5 years ago

rook-ceph-mon0:      0> 2019-07-02 12:16:42.239710 7f4bf1b77ec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f4bf1b77ec0 time 2019-07-02 12:16:42.237387
rook-ceph-mon0: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.7/rpm/el7/BUILD/ceph-12.2.7/src/mon/AuthMonitor.cc: 157: FAILED assert(ret == 0)
Actions #3

Updated by Greg Farnum almost 5 years ago

  • Tracker changed from Bug to Support
  • Status changed from New to Closed
  • Priority changed from High to Normal

That assert means there was a read error when the monitor tried to get data off of disk. Check your disk!

Actions #4

Updated by sdkfzv sdkfzv almost 5 years ago

Greg Farnum wrote:

That assert means there was a read error when the monitor tried to get data off of disk. Check your disk!

Thanks.
But it seems no error found on disk. I copy the data/store.db/ of the failed monitor to a new monitor, then the new one hit the same assert. So the datas in DB may be wrong.
The return code of the "ret" is "-ENOENT" which means no entity found, so is it possible that the datas were not wrote to disk correctlly? But not sure about this for I lost the log before the host restarting.

Actions #5

Updated by Greg Farnum almost 5 years ago

Ah, ENOENT might be a code bug. Unless you have debug logs of the monitor from when it was writing that data to disk I don't think we can do much with it though. (Also, I think we found and fixed a couple bugs around that since 12.2.7, but I can't find them in a quick search.)

Probably best to wipe the monitor and re-add a new one.

Actions #7

Updated by Neha Ojha over 3 years ago

  • Related to Bug #40712: ceph-mon crash with assert(err == 0) after rocksdb->get added
Actions #8

Updated by Neha Ojha over 3 years ago

  • Status changed from Closed to New
  • Priority changed from Normal to High
Actions #9

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52178: crash: virtual void AuthMonitor::update_from_paxos(bool*): assert(ret == 0) added
Actions #10

Updated by Neha Ojha over 2 years ago

  • Has duplicate Bug #52156: crash: virtual void OSDMonitor::update_from_paxos(bool*): assert(err == 0) added
Actions #11

Updated by Sage Weil over 2 years ago

  • Tracker changed from Support to Bug
  • Regression set to No
  • Severity set to 3 - minor
Actions #12

Updated by Neha Ojha about 2 years ago

  • Priority changed from High to Normal
Actions #13

Updated by Telemetry Bot about 2 years ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v14.2.22, v15.2.10, v15.2.13, v15.2.14, v15.2.15, v15.2.7, v15.2.8, v15.2.9, v16.2.6, v16.2.7 added

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=2fba2ef6afdb444f87f1a9c175763a4e1649b21073143399609f1291da0f1bf8

Assert condition: ret == 0
Assert function: virtual void AuthMonitor::update_from_paxos(bool*)

Sanitized backtrace:

    AuthMonitor::update_from_paxos(bool*)
    PaxosService::refresh(bool*)
    Monitor::refresh_from_paxos(bool*)
    Monitor::init_paxos()
    Monitor::preinit()

Crash dump sample:
{
    "assert_condition": "ret == 0",
    "assert_file": "mon/AuthMonitor.cc",
    "assert_func": "virtual void AuthMonitor::update_from_paxos(bool*)",
    "assert_line": 316,
    "assert_msg": "mon/AuthMonitor.cc: In function 'virtual void AuthMonitor::update_from_paxos(bool*)' thread 7f519792a700 time 2022-03-13T05:45:59.952887+0000\nmon/AuthMonitor.cc: 316: FAILED ceph_assert(ret == 0)",
    "assert_thread_name": "ceph-mon",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12c20) [0x7f518c7e4c20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f518eaa8ba3]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f518eaa8d6c]",
        "(AuthMonitor::update_from_paxos(bool*)+0x2657) [0x563c7513fec7]",
        "(PaxosService::refresh(bool*)+0x10e) [0x563c751fd29e]",
        "(Monitor::refresh_from_paxos(bool*)+0x18c) [0x563c750ae2dc]",
        "(Monitor::init_paxos()+0x10c) [0x563c750ae5ec]",
        "(Monitor::preinit()+0xd30) [0x563c750dbaa0]",
        "main()",
        "__libc_start_main()",
        "_start()" 
    ],
    "ceph_version": "16.2.7",
    "crash_id": "2022-03-13T05:45:59.957404Z_73f83a21-e2de-499d-bfee-1a322e60dc11",
    "entity_name": "mon.49425e5435c06178019e534413cc41d992e230b2",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mon",
    "stack_sig": "3fdd4ce4b0d6529edfd9648756a49bcc247aa294fddd59d16b79cbb5b5fc93a8",
    "timestamp": "2022-03-13T05:45:59.957404Z",
    "utsname_machine": "x86_64",
    "utsname_release": "5.13.0-35-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#40-Ubuntu SMP Mon Mar 7 08:03:10 UTC 2022" 
}

Actions #14

Updated by Radoslaw Zarzynski over 1 year ago

  • Related to Bug #58305: src/mon/AuthMonitor.cc: FAILED ceph_assert(version > keys_ver) added
Actions

Also available in: Atom PDF