Project

General

Profile

Actions

Bug #49809

open

1 out of 3 mon crashed in MonitorDBStore::get_synchronizer

Added by Christian Rohmann about 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We experienced a single mon crash (out of 3 mons) - We observed no other issues on the machine or the cluster.

I attached the ceph-mon.log to this ticket.

ceph crash info:

{
    "os_version_id": "18.04", 
    "utsname_release": "4.15.0-136-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "mon.ctrl-01", 
    "timestamp": "2021-03-14 18:00:28.686018Z", 
    "process_name": "ceph-mon", 
    "utsname_machine": "x86_64", 
    "utsname_sysname": "Linux", 
    "os_version": "18.04.5 LTS (Bionic Beaver)", 
    "os_id": "ubuntu", 
    "utsname_version": "#140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021", 
    "backtrace": [
        "(()+0x12980) [0x7f1c0cfbe980]", 
        "(MonitorDBStore::get_synchronizer(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+0x3f) [0x555e110c756f]", 
        "(Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0xe9) [0x555e11089379]", 
        "(Monitor::scrub()+0x298) [0x555e11092658]", 
        "(Monitor::scrub_start()+0x137) [0x555e11092847]", 
        "(C_MonContext::finish(int)+0x39) [0x555e1106a899]", 
        "(Context::complete(int)+0x9) [0x555e110adc69]", 
        "(SafeTimer::timer_thread()+0x190) [0x7f1c0e1dbd20]", 
        "(SafeTimerThread::entry()+0xd) [0x7f1c0e1dd5ed]", 
        "(()+0x76db) [0x7f1c0cfb36db]", 
        "(clone()+0x3f) [0x7f1c0c19971f]" 
    ], 
    "utsname_hostname": "ctrl-01", 
    "crash_id": "2021-03-14_18:00:28.686018Z_274d6160-d49e-4395-9d8c-64d4dcad53f5", 
    "ceph_version": "14.2.16" 
}

ceph.log reported:

2021-03-14 18:00:00.000161 mon.ctrl-01 (mon.0) 224915 : cluster [INF] overall HEALTH_OK
2021-03-14 18:00:27.513970 mon.ctrl-03 (mon.2) 146401 : cluster [INF] mon.ctrl-03 calling monitor election
2021-03-14 18:00:27.516169 mon.ctrl-02 (mon.1) 122690 : cluster [INF] mon.ctrl-02 calling monitor election
2021-03-14 18:00:39.412761 mon.ctrl-02 (mon.1) 122691 : cluster [INF] mon.ctrl-02 is new leader, mons ctrl-02,ctrl-03 in quorum (ranks 1,2)
2021-03-14 18:00:39.423504 mon.ctrl-02 (mon.1) 122696 : cluster [WRN] Health check failed: 1/3 mons down, quorum ctrl-02,ctrl-03 (MON_DOWN)
2021-03-14 18:00:39.431667 mon.ctrl-02 (mon.1) 122697 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ctrl-02,ctrl-03
2021-03-14 18:00:40.733430 mon.ctrl-01 (mon.0) 1 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:40.742270 mon.ctrl-01 (mon.0) 2 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:45.745709 mon.ctrl-01 (mon.0) 3 : cluster [INF] mon.ctrl-01 is new leader, mons ctrl-01,ctrl-03 in quorum (ranks 0,2)
2021-03-14 18:00:51.930256 mon.ctrl-02 (mon.1) 122710 : cluster [INF] mon.ctrl-02 calling monitor election
2021-03-14 18:00:51.934337 mon.ctrl-01 (mon.0) 4 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ctrl-02,ctrl-03
2021-03-14 18:00:51.934422 mon.ctrl-01 (mon.0) 5 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:51.960175 mon.ctrl-01 (mon.0) 6 : cluster [INF] mon.ctrl-01 is new leader, mons ctrl-01,ctrl-02,ctrl-03 in quorum (ranks 0,1,2)
2021-03-14 18:00:51.970642 mon.ctrl-01 (mon.0) 11 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ctrl-02,ctrl-03)
2021-03-14 18:00:51.970670 mon.ctrl-01 (mon.0) 12 : cluster [INF] Cluster is now healthy
2021-03-14 18:00:51.979150 mon.ctrl-01 (mon.0) 13 : cluster [INF] overall HEALTH_OK
2021-03-14 18:03:47.551678 mon.ctrl-01 (mon.0) 49 : cluster [WRN] Health check failed: 1 daemons have recently crashed (RECENT_CRASH)

Files

ceph-mon.log.gz (220 KB) ceph-mon.log.gz Christian Rohmann, 03/15/2021 05:14 PM
Actions

Also available in: Atom PDF