Project

General

Profile

Actions

Bug #49809

open

1 out of 3 mon crashed in MonitorDBStore::get_synchronizer

Added by Christian Rohmann about 3 years ago. Updated almost 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Monitor
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We experienced a single mon crash (out of 3 mons) - We observed no other issues on the machine or the cluster.

I attached the ceph-mon.log to this ticket.

ceph crash info:

{
    "os_version_id": "18.04", 
    "utsname_release": "4.15.0-136-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "mon.ctrl-01", 
    "timestamp": "2021-03-14 18:00:28.686018Z", 
    "process_name": "ceph-mon", 
    "utsname_machine": "x86_64", 
    "utsname_sysname": "Linux", 
    "os_version": "18.04.5 LTS (Bionic Beaver)", 
    "os_id": "ubuntu", 
    "utsname_version": "#140-Ubuntu SMP Thu Jan 28 05:20:47 UTC 2021", 
    "backtrace": [
        "(()+0x12980) [0x7f1c0cfbe980]", 
        "(MonitorDBStore::get_synchronizer(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+0x3f) [0x555e110c756f]", 
        "(Monitor::_scrub(ScrubResult*, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, int*)+0xe9) [0x555e11089379]", 
        "(Monitor::scrub()+0x298) [0x555e11092658]", 
        "(Monitor::scrub_start()+0x137) [0x555e11092847]", 
        "(C_MonContext::finish(int)+0x39) [0x555e1106a899]", 
        "(Context::complete(int)+0x9) [0x555e110adc69]", 
        "(SafeTimer::timer_thread()+0x190) [0x7f1c0e1dbd20]", 
        "(SafeTimerThread::entry()+0xd) [0x7f1c0e1dd5ed]", 
        "(()+0x76db) [0x7f1c0cfb36db]", 
        "(clone()+0x3f) [0x7f1c0c19971f]" 
    ], 
    "utsname_hostname": "ctrl-01", 
    "crash_id": "2021-03-14_18:00:28.686018Z_274d6160-d49e-4395-9d8c-64d4dcad53f5", 
    "ceph_version": "14.2.16" 
}

ceph.log reported:

2021-03-14 18:00:00.000161 mon.ctrl-01 (mon.0) 224915 : cluster [INF] overall HEALTH_OK
2021-03-14 18:00:27.513970 mon.ctrl-03 (mon.2) 146401 : cluster [INF] mon.ctrl-03 calling monitor election
2021-03-14 18:00:27.516169 mon.ctrl-02 (mon.1) 122690 : cluster [INF] mon.ctrl-02 calling monitor election
2021-03-14 18:00:39.412761 mon.ctrl-02 (mon.1) 122691 : cluster [INF] mon.ctrl-02 is new leader, mons ctrl-02,ctrl-03 in quorum (ranks 1,2)
2021-03-14 18:00:39.423504 mon.ctrl-02 (mon.1) 122696 : cluster [WRN] Health check failed: 1/3 mons down, quorum ctrl-02,ctrl-03 (MON_DOWN)
2021-03-14 18:00:39.431667 mon.ctrl-02 (mon.1) 122697 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ctrl-02,ctrl-03
2021-03-14 18:00:40.733430 mon.ctrl-01 (mon.0) 1 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:40.742270 mon.ctrl-01 (mon.0) 2 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:45.745709 mon.ctrl-01 (mon.0) 3 : cluster [INF] mon.ctrl-01 is new leader, mons ctrl-01,ctrl-03 in quorum (ranks 0,2)
2021-03-14 18:00:51.930256 mon.ctrl-02 (mon.1) 122710 : cluster [INF] mon.ctrl-02 calling monitor election
2021-03-14 18:00:51.934337 mon.ctrl-01 (mon.0) 4 : cluster [WRN] overall HEALTH_WARN 1/3 mons down, quorum ctrl-02,ctrl-03
2021-03-14 18:00:51.934422 mon.ctrl-01 (mon.0) 5 : cluster [INF] mon.ctrl-01 calling monitor election
2021-03-14 18:00:51.960175 mon.ctrl-01 (mon.0) 6 : cluster [INF] mon.ctrl-01 is new leader, mons ctrl-01,ctrl-02,ctrl-03 in quorum (ranks 0,1,2)
2021-03-14 18:00:51.970642 mon.ctrl-01 (mon.0) 11 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum ctrl-02,ctrl-03)
2021-03-14 18:00:51.970670 mon.ctrl-01 (mon.0) 12 : cluster [INF] Cluster is now healthy
2021-03-14 18:00:51.979150 mon.ctrl-01 (mon.0) 13 : cluster [INF] overall HEALTH_OK
2021-03-14 18:03:47.551678 mon.ctrl-01 (mon.0) 49 : cluster [WRN] Health check failed: 1 daemons have recently crashed (RECENT_CRASH)

Files

ceph-mon.log.gz (220 KB) ceph-mon.log.gz Christian Rohmann, 03/15/2021 05:14 PM
Actions #1

Updated by Neha Ojha about 3 years ago

  • Status changed from New to Need More Info

Do you happen to have a coredump for this or can you provide a copy of the mon store db?

Actions #2

Updated by Christian Rohmann about 3 years ago

Neha Ojha wrote:

Do you happen to have a coredump for this or can you provide a copy of the mon store db?

Unfortunately there is no coredump (unless Ceph has stored that somewhere I did not look).
I gladly provide a copy of the mon store db - just let me know how to create / dump what you need.

Actions #3

Updated by Brad Hubbard about 3 years ago

Hi Christian,

From the log this was a segfault in MonitorDBStore::get_synchronizer.

2021-03-14 18:00:28.679 7f1c0236f700 -1 ** Caught signal (Segmentation fault) *

If you could archive the store.db directory of the mon and upload that with
ceph-post-file that might be helpful. Please also include the dmesg and syslog
files from the time the crash occurred. Let us know the id returned by
ceph-post-file in a comment so we can find it on our side.

Meantime, I'm working on trying to narrow down the exact line where the segfault
took place which will give us a better idea of what specific data caused the
issue.

Actions #4

Updated by Brad Hubbard about 3 years ago

  • Assignee set to Brad Hubbard
Actions #5

Updated by Brad Hubbard about 3 years ago

So the offset into MonitorDBStore::get_synchronizer where we segfaulted was
0x3f.

If we set up a binary compatible environment we can work out what line and
instruction that was in gdb.

(gdb) p/d 0x3f
$2 = 63
(gdb) disass /m MonitorDBStore::get_synchronizer
...
520 if (!key.first.empty() && !key.second.empty())
0x000055555566a56f <+63>: cmpq $0x0,0x8(%rbx)

So that is line 520 in src/mon/MonitorDBStore.h

(gdb) l src/mon/MonitorDBStore.h:520
515 Synchronizer get_synchronizer(pair<string,string> &key,
516 set<string> &prefixes) {
517 KeyValueDB::WholeSpaceIterator iter;
518 iter = db->get_wholespace_iterator();
519
520 if (!key.first.empty() && !key.second.empty())
521 iter->upper_bound(key.first, key.second);
522 else
523 iter->seek_to_first();
524

So this looks like there is something wrong with the 'key' pair passed in to
this function from Monitor::_scrub. An analysis of the database might tell us
which key/value pair that was and how it is corrupt.

I'd expect to see this every day on this mon since mon_scrub_interval is by
default one day but perhaps the offending kv pair was successfully
overwritten or deleted, or perhaps this was a memory glitch on the machine
itself? If that's the case, in the absence of a core dump, we are unlikely to be
able to determine what the issue was.

Actions #6

Updated by Christian Rohmann about 3 years ago

Brad Hubbard wrote:

Hi Christian,

From the log this was a segfault in MonitorDBStore::get_synchronizer.

2021-03-14 18:00:28.679 7f1c0236f700 -1 ** Caught signal (Segmentation fault) *

If you could archive the store.db directory of the mon and upload that with
ceph-post-file that might be helpful. Please also include the dmesg and syslog
files from the time the crash occurred. Let us know the id returned by
ceph-post-file in a comment so we can find it on our side.

Meantime, I'm working on trying to narrow down the exact line where the segfault
took place which will give us a better idea of what specific data caused the
issue.

Thank Brad for looking into this issue! I tared the store.db and also added the ceph-mon log output.
But there is no more info than I shared in the initial post. Nothing in dmesg or else.

Here you go with the ID:

ceph-post-file: 7a666273-a0c5-43b4-8764-e023a50edead

Actions #7

Updated by Brad Hubbard about 3 years ago

Hi Christian,

Is there no message in syslog or dmesg about the segfault at all? That seems odd. Maybe you should check why your system is not creating coredumps and not even logging about application segfaults?

I'm beginning to think the previous line makes more sense in the context of what we are seeing and threfore I'm leaning away from database corruption.

I'll explain.

./src/mon/MonitorDBStore.h:
518         iter = db->get_wholespace_iterator();
   0x000000000026a55b <+43>:    mov    (%rsi),%rax
   0x000000000026a55e <+46>:    mov    %rdi,%r12
   0x000000000026a561 <+49>:    mov    %rdx,%rbx
   0x000000000026a564 <+52>:    mov    %rcx,%r15
   0x000000000026a567 <+55>:    lea    0x20(%rsp),%rdi
   0x000000000026a56c <+60>:    callq  *0x60(%rax) <<--------------- HERE

519
520         if (!key.first.empty() && !key.second.empty())
   0x000000000026a56f <+63>:    cmpq   $0x0,0x8(%rbx)

The position counter (instruction pointer if you like) is pointing to +63 but it will move there as we execute +60 and that appears to be the call to db->get_wholespace_iterator().

If I run the binary under gdb and set a breakpoint on get_wholespace_iterator() I see the following.

(gdb) bt 3
#0  RocksDBStore::get_wholespace_iterator (this=0x555556754300) at ./src/kv/RocksDBStore.cc:1525
#1  0x000055555566a56f in MonitorDBStore::get_synchronizer (this=<optimized out>, key={...}, prefixes=std::set with 17 elements = {...}) at ./src/mon/MonitorDBStore.h:518
#2  0x000055555562c379 in Monitor::_scrub (this=0x55555677a800, r=0x5555576d0028, start=0x5555573facd8, num_keys=0x7fffe1423c8c) at ./src/mon/Monitor.cc:5539
(gdb) f 1
#1  0x000055555566a56f in MonitorDBStore::get_synchronizer (this=<optimized out>, key={...}, prefixes=std::set with 17 elements = {...}) at ./src/mon/MonitorDBStore.h:518
518         iter = db->get_wholespace_iterator();
(gdb) x/i $pc
=> 0x55555566a56f <MonitorDBStore::get_synchronizer(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+63>:     cmpq   $0x0,0x8(%rbx)
(gdb) x/a 0x000055555566a56f
0x55555566a56f <MonitorDBStore::get_synchronizer(std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)+63>:        0x748b4c00087b8348
(gdb) info reg rax
rax            0x555555eacdc8   93825002032584
(gdb) p/x 0x555555eacdc8+0x60
$10 = 0x555555eace28                            
(gdb) x/a 0x555555eace28
0x555555eace28 <_ZTV12RocksDBStore+112>:        0x55555585c820 <RocksDBStore::get_wholespace_iterator()>

Without going into ridiculous detail this is enough to convince me that the pointer to the RocksDBStore instance (db) in MonitorDBStore::get_synchronizer was invalid and that's why we end up with the weird frame 0 on the stack.

"(()+0x12980) [0x7f1c0cfbe980]",

It looks to me like we are actually trying to call a function at 0x12980 indicating that 'db' is uninitialised or has been, or is in the process of being, deleted. This is a better explanation of why you have only seen this once than that the database corruption was deleted/changed/fixed itself. I think there wasn't any corruption and what we are seeing here is a rare race condition which leads to the db structure being invalid at that point in the code under exactly the right circumstances. That also explains why I can find no corruption in the database you uploaded and can list/dump all keys and values without issue.

I need to look further into how this might come about.

Actions #8

Updated by Brad Hubbard about 3 years ago

I think I've found an issue related to this message in the logs but I'll need to test that theory.

2021-03-14 18:00:28.399 7f1bffb6a700  1 mon.ctrl-01@0(electing) e6 collect_metadata :  no unique device id for : fallback method has no model nor serial'

Will update in the next couple of days.

Actions #9

Updated by Christian Rohmann about 3 years ago

Brad Hubbard wrote:

I think I've found an issue related to this message in the logs but I'll need to test that theory.

[...]

Will update in the next couple of days.

Thank you very much Brad for really digging into this!
I really appreciate that such a report of a "single" and apparently "random" crash is taken this seriously!

Actions #10

Updated by Christian Rohmann almost 3 years ago

Brad were you able to find out more about the root cause of this crash?

Actions #11

Updated by Brad Hubbard almost 3 years ago

Hi Christian,

No, unfortunately I hit a dead end on this as the log message issue was a red herring.

I'm afraid all we know is that the db pointer somehow ended up corrupted but we would need more information to work out why.

Actions

Also available in: Atom PDF