Bug #41025: 2/3 mon process crash - complete cluster failure - Ceph - Ceph

Actions

Copy link

Bug #41025

open

2/3 mon process crash - complete cluster failure

Added by Anonymous over 4 years ago. Updated over 4 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Monitor

Target version:

v14.2.2

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We upgraded to Nautilus 14.2.2 (Ubuntu 18.04) a few days ago and today had some issues with MDS ('ll open another report).
First our MDS procesess died uneable to come by and during trying to fix MDS (decreasing ranks etc) now 2/3 mon processes are dead.

Logs show:

-2> 2019-07-31 15:01:31.581 7f12a96fb700 10 log_client will send 2019-07-31 15:01:31.582687 mon.km-fsn-1-dc4-m1-797679 (mon.1) 4 : cluster [DBG] monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
-1> 2019-07-31 15:01:31.581 7f12a96fb700 5 mon.km-fsn-1-dc4-m1-797679@1(leader).paxos(paxos active c 38758417..38759000) is_readable = 1 - now=2019-07-31 15:01:31.582708 lease_expire=2019-07-31 15:01:36.581412 has v0 lc 38759000
0> 2019-07-31 15:01:31.585 7f12a96fb700 -1 ** Caught signal (Aborted) *
in thread 7f12a96fb700 thread_name:ms_dispatch

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (()+0x11390) [0x7f12b56ef390]
 2: (gsignal()+0x38) [0x7f12b4e3c428]
 3: (abort()+0x16a) [0x7f12b4e3e02a]
 4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x7f12b6ec1155]
 5: (_cxxabiv1::_terminate(void ()())+0x6) [0x7f12b6eb5136]
 6: (()+0x8ad181) [0x7f12b6eb5181]
 7: (()+0x8b9394) [0x7f12b6ec1394]
 8: (std::__throw_out_of_range(char const)+0x3f) [0x7f12b6eccabf]
 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x86db50]
 10: (MDSMonitor::tick()+0xc9) [0x86f3b9]
 11: (MDSMonitor::on_active()+0x28) [0x858a88]
 12: (PaxosService::_active()+0xdd) [0x7ac0dd]
 13: (Context::complete(int)+0x9) [0x6af0b9]
 14: (void finish_contexts&lt;std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; > >(CephContext*, std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa8) [0x6d7e68]
 15: (Paxos::finish_round()+0x76) [0x7a2826]
 16: (Paxos::handle_last(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0xbff) [0x7a3a2f]
 17: (Paxos::dispatch(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x24b) [0x7a44db]
 18: (Monitor::dispatch_op(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x15c5) [0x6a9025]
 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x6a9672]
 20: (Monitor::ms_dispatch(Message*)+0x26) [0x6d9506]
 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr&lt;Message&gt; const&)+0x26) [0x6d5576]
 22: (DispatchQueue::entry()+0x1219) [0x7f12b6b092f9]
 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f12b6bb9b9d]
 24: (()+0x76ba) [0x7f12b56e56ba]
 25: (clone()+0x6d) [0x7f12b4f0e41d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.km-fsn-1-dc4-m1-797679.log

Files

cephfail.log (157 KB) cephfail.log

Anonymous, 07/31/2019 01:29 PM

Actions

Copy link

Updated by Anonymous over 4 years ago

File cephfail.log cephfail.log added

attached crashdump log

Actions

Copy link

Updated by Anonymous over 4 years ago

We're unable to recover because starting a second mon causes the first mon process to crash.

Actions

Copy link

Updated by Anonymous over 4 years ago

ing c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685770 lease_expire=0.000000 has v0 lc 38759000
-26> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685779 lease_expire=0.000000 has v0 lc 38759000
-25> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685783 lease_expire=0.000000 has v0 lc 38759000
-24> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685790 lease_expire=0.000000 has v0 lc 38759000
-23> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685798 lease_expire=0.000000 has v0 lc 38759000
-22> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685808 lease_expire=0.000000 has v0 lc 38759000
-21> 2019-07-31 15:32:55.683 7f39c38c3700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685813 lease_expire=0.000000 has v0 lc 38759000
-20> 2019-07-31 15:32:55.683 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685902 lease_expire=0.000000 has v0 lc 38759000
-19> 2019-07-31 15:32:55.683 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685924 lease_expire=0.000000 has v0 lc 38759000
-18> 2019-07-31 15:32:55.683 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685949 lease_expire=0.000000 has v0 lc 38759000
-17> 2019-07-31 15:32:55.683 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.685959 lease_expire=0.000000 has v0 lc 38759000
-16> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699565 lease_expire=0.000000 has v0 lc 38759000
-15> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699576 lease_expire=0.000000 has v0 lc 38759000
-14> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699599 lease_expire=0.000000 has v0 lc 38759000
-13> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699607 lease_expire=0.000000 has v0 lc 38759000
-12> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699624 lease_expire=0.000000 has v0 lc 38759000
-11> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699631 lease_expire=0.000000 has v0 lc 38759000
-10> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699649 lease_expire=0.000000 has v0 lc 38759000
-9> 2019-07-31 15:32:55.695 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos recovering c 38758417..38759000) is_readable = 0 - now=2019-07-31 15:32:55.699657 lease_expire=0.000000 has v0 lc 38759000
-8> 2019-07-31 15:32:55.703 7f39c10be700 4 set_mon_vals no callback set
-7> 2019-07-31 15:32:55.707 7f39c10be700 4 mgrc handle_mgr_map Got map version 275
-6> 2019-07-31 15:32:55.707 7f39c10be700 4 mgrc handle_mgr_map Active mgr is now [v2:10.3.0.1:6801/1321146,v1:10.3.0.1:6802/1321146]
-5> 2019-07-31 15:32:55.707 7f39c10be700 0 log_channel(cluster) log [DBG] : monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
-4> 2019-07-31 15:32:55.707 7f39c10be700 10 log_client _send_to_mon log to self
-3> 2019-07-31 15:32:55.707 7f39c10be700 10 log_client log_queue is 3 last_log 3 sent 2 num 3 unsent 1 sending 1
-2> 2019-07-31 15:32:55.707 7f39c10be700 10 log_client will send 2019-07-31 15:32:55.710898 mon.km-fsn-1-dc4-m1-797678 (mon.0) 3 : cluster [DBG] monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
-1> 2019-07-31 15:32:55.707 7f39c10be700 5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 38758417..38759000) is_readable = 1 - now=2019-07-31 15:32:55.710975 lease_expire=2019-07-31 15:33:00.707365 has v0 lc 38759000
0> 2019-07-31 15:32:55.711 7f39c10be700 -1 ** Caught signal (Aborted) *
in thread 7f39c10be700 thread_name:ms_dispatch

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (()+0x11390) [0x7f39cd0b2390]
 2: (gsignal()+0x38) [0x7f39cc7ff428]
 3: (abort()+0x16a) [0x7f39cc80102a]
 4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x7f39ce884155]
 5: (_cxxabiv1::_terminate(void ()())+0x6) [0x7f39ce878136]
 6: (()+0x8ad181) [0x7f39ce878181]
 7: (()+0x8b9394) [0x7f39ce884394]
 8: (std::__throw_out_of_range(char const)+0x3f) [0x7f39ce88fabf]
 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x86db50]
 10: (MDSMonitor::tick()+0xc9) [0x86f3b9]
 11: (MDSMonitor::on_active()+0x28) [0x858a88]
 12: (PaxosService::_active()+0xdd) [0x7ac0dd]
 13: (Context::complete(int)+0x9) [0x6af0b9]
 14: (void finish_contexts&lt;std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; > >(CephContext*, std::__cxx11::list&lt;Context*, std::allocator&lt;Context*&gt; >&, int)+0xa8) [0x6d7e68]
 15: (Paxos::finish_round()+0x76) [0x7a2826]
 16: (Paxos::handle_last(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0xbff) [0x7a3a2f]
 17: (Paxos::dispatch(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x24b) [0x7a44db]
 18: (Monitor::dispatch_op(boost::intrusive_ptr&lt;MonOpRequest&gt;)+0x15c5) [0x6a9025]
 19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x6a9672]
 20: (Monitor::ms_dispatch(Message*)+0x26) [0x6d9506]
 21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr&lt;Message&gt; const&)+0x26) [0x6d5576]
 22: (DispatchQueue::entry()+0x1219) [0x7f39ce4cc2f9]
 23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f39ce57cb9d]
 24: (()+0x76ba) [0x7f39cd0a86ba]
 25: (clone()+0x6d) [0x7f39cc8d141d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.km-fsn-1-dc4-m1-797678.log
-- end dump of recent events ---

Actions

Copy link

Updated by Anonymous over 4 years ago

2019-07-31 15:31:41.535 7f752359f700 1 mon.km-fsn-1-dc4-m1-797678@0(probing) e4 handle_auth_request failed to assign global_id
2019-07-31 15:31:41.747 7f7525da4700 0 log_channel(cluster) log [INF] : mon.km-fsn-1-dc4-m1-797678 calling monitor election
2019-07-31 15:31:41.747 7f7525da4700 1 mon.km-fsn-1-dc4-m1-797678@0(electing).elector(924) init, last seen epoch 924
2019-07-31 15:31:41.791 7f7525da4700 -1 mon.km-fsn-1-dc4-m1-797678@0(electing) e4 failed to get devid for : fallback method has serial ''but no model
2019-07-31 15:31:43.139 7f7522d9e700 1 mon.km-fsn-1-dc4-m1-797678@0(electing) e4 handle_auth_request failed to assign global_id
2019-07-31 15:31:46.343 7f7523da0700 1 mon.km-fsn-1-dc4-m1-797678@0(electing) e4 handle_auth_request failed to assign global_id
2019-07-31 15:31:46.815 7f75285a9700 0 log_channel(cluster) log [INF] : mon.km-fsn-1-dc4-m1-797678 is new leader, mons km-fsn-1-dc4-m1-797678,km-fsn-1-dc4-m1-797679 in quorum (ranks 0,1)
2019-07-31 15:31:46.847 7f7525da4700 0 log_channel(cluster) log [DBG] : monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
2019-07-31 15:31:46.851 7f7525da4700 -1 ** Caught signal (Aborted) *
in thread 7f7525da4700 thread_name:ms_dispatch

Actions

Copy link

Updated by Anonymous over 4 years ago

After attemping to inject a single-mon monmap.

   -18> 2019-07-31 16:11:24.530 7fbb0c3b9340  2 auth: KeyRing::load: loaded key file /var/lib/ceph/mon/ceph-km-fsn-1-dc4-m1-797678/keyring
   -17> 2019-07-31 16:11:24.530 7fbb0c3b9340  2 mon.km-fsn-1-dc4-m1-797678@-1(???) e5 init
   -16> 2019-07-31 16:11:24.530 7fbb0c3b9340  4 mgrc handle_mgr_map Got map version 275
   -15> 2019-07-31 16:11:24.530 7fbb0c3b9340  4 mgrc handle_mgr_map Active mgr is now [v2:10.3.0.1:6801/1321146,v1:10.3.0.1:6802/1321146]
   -14> 2019-07-31 16:11:24.530 7fbb0c3b9340  4 mgrc reconnect Starting new session with [v2:10.3.0.1:6801/1321146,v1:10.3.0.1:6802/1321146]
   -13> 2019-07-31 16:11:24.534 7fbb0c3b9340  0 mon.km-fsn-1-dc4-m1-797678@-1(probing) e5  my rank is now 0 (was -1)
   -12> 2019-07-31 16:11:24.534 7fbb0c3b9340  1 mon.km-fsn-1-dc4-m1-797678@0(probing) e5 win_standalone_election
   -11> 2019-07-31 16:11:24.534 7fbb0c3b9340  1 mon.km-fsn-1-dc4-m1-797678@0(probing).elector(981) init, last seen epoch 981, mid-election, bumping
   -10> 2019-07-31 16:11:24.542 7fbb0c3b9340 -1 mon.km-fsn-1-dc4-m1-797678@0(electing) e5 failed to get devid for : fallback method has serial ''but no model
    -9> 2019-07-31 16:11:24.542 7fbb0c3b9340  0 log_channel(cluster) log [INF] : mon.km-fsn-1-dc4-m1-797678 is new leader, mons km-fsn-1-dc4-m1-797678 in quorum (ranks 0)
    -8> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client _send_to_mon log to self
    -7> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client  log_queue is 1 last_log 1 sent 0 num 1 unsent 1 sending 1
    -6> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client  will send 2019-07-31 16:11:24.548759 mon.km-fsn-1-dc4-m1-797678 (mon.0) 1 : cluster [INF] mon.km-fsn-1-dc4-m1-797678 is new leader, mons km-fsn-1-dc4-m1-797678 in quorum (ranks 0)
    -5> 2019-07-31 16:11:24.542 7fbb0c3b9340  0 log_channel(cluster) log [DBG] : monmap e5: 1 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0]}
    -4> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client _send_to_mon log to self
    -3> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client  log_queue is 2 last_log 2 sent 1 num 2 unsent 1 sending 1
    -2> 2019-07-31 16:11:24.542 7fbb0c3b9340 10 log_client  will send 2019-07-31 16:11:24.548795 mon.km-fsn-1-dc4-m1-797678 (mon.0) 2 : cluster [DBG] monmap e5: 1 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0]}
    -1> 2019-07-31 16:11:24.542 7fbb0c3b9340  5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 38758417..38759000) is_readable = 1 - now=2019-07-31 16:11:24.548821 lease_expire=0.000000 has v0 lc 38759000
     0> 2019-07-31 16:11:24.546 7fbb0c3b9340 -1 *** Caught signal (Aborted) **
 in thread 7fbb0c3b9340 thread_name:ceph-mon

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (()+0x11390) [0x7fbb022e8390]
 2: (gsignal()+0x38) [0x7fbb01a35428]
 3: (abort()+0x16a) [0x7fbb01a3702a]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7fbb03aba155]
 5: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7fbb03aae136]
 6: (()+0x8ad181) [0x7fbb03aae181]
 7: (()+0x8b9394) [0x7fbb03aba394]
 8: (std::__throw_out_of_range(char const*)+0x3f) [0x7fbb03ac5abf]
 9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x86db50]
 10: (MDSMonitor::tick()+0xc9) [0x86f3b9]
 11: (MDSMonitor::on_active()+0x28) [0x858a88]
 12: (PaxosService::_active()+0xdd) [0x7ac0dd]
 13: (PaxosService::election_finished()+0x4e) [0x7ac88e]
 14: (Monitor::_finish_svc_election()+0x50) [0x66c080]
 15: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, mon_feature_t const&, int, std::map<int, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::less<int>, std::allocator<std::pair<int const, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > > > > const&)+0x2a1) [0x68e2e1]
 16: (Monitor::win_standalone_election()+0x21d) [0x68ed4d]
 17: (Monitor::bootstrap()+0x5c5) [0x68f425]
 18: (Monitor::init()+0x220) [0x68fe90]
 19: (main()+0x27ad) [0x5792ad]
 20: (__libc_start_main()+0xf0) [0x7fbb01a20830]
 21: (_start()+0x29) [0x65d759]

Actions

Copy link

Updated by Anonymous over 4 years ago

The reason seems to be MDSmap related. I remember decreasing the MDS ranks to 1 and then the cluster died.
The out of range error probably shows a database error between active MDS ranks 2 and 1...which causes the mon to crash

Actions

Copy link

Updated by Anonymous over 4 years ago

temp fix: monmap to 1 mon, starting single mon cluster

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #41025

2/3 mon process crash - complete cluster failure

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago

Updated by Anonymous over 4 years ago