Bug #41025
open2/3 mon process crash - complete cluster failure
0%
Description
We upgraded to Nautilus 14.2.2 (Ubuntu 18.04) a few days ago and today had some issues with MDS ('ll open another report).
First our MDS procesess died uneable to come by and during trying to fix MDS (decreasing ranks etc) now 2/3 mon processes are dead.
Logs show:
-2> 2019-07-31 15:01:31.581 7f12a96fb700 10 log_client will send 2019-07-31 15:01:31.582687 mon.km-fsn-1-dc4-m1-797679 (mon.1) 4 : cluster [DBG] monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
-1> 2019-07-31 15:01:31.581 7f12a96fb700 5 mon.km-fsn-1-dc4-m1-797679@1(leader).paxos(paxos active c 38758417..38759000) is_readable = 1 - now=2019-07-31 15:01:31.582708 lease_expire=2019-07-31 15:01:36.581412 has v0 lc 38759000
0> 2019-07-31 15:01:31.585 7f12a96fb700 -1 ** Caught signal (Aborted) *
in thread 7f12a96fb700 thread_name:ms_dispatch
ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
1: (()+0x11390) [0x7f12b56ef390]
2: (gsignal()+0x38) [0x7f12b4e3c428]
3: (abort()+0x16a) [0x7f12b4e3e02a]
4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x7f12b6ec1155]
5: (_cxxabiv1::_terminate(void ()())+0x6) [0x7f12b6eb5136]
6: (()+0x8ad181) [0x7f12b6eb5181]
7: (()+0x8b9394) [0x7f12b6ec1394]
8: (std::__throw_out_of_range(char const)+0x3f) [0x7f12b6eccabf]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x86db50]
10: (MDSMonitor::tick()+0xc9) [0x86f3b9]
11: (MDSMonitor::on_active()+0x28) [0x858a88]
12: (PaxosService::_active()+0xdd) [0x7ac0dd]
13: (Context::complete(int)+0x9) [0x6af0b9]
14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x6d7e68]
15: (Paxos::finish_round()+0x76) [0x7a2826]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x7a3a2f]
17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x7a44db]
18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) [0x6a9025]
19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x6a9672]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x6d9506]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x6d5576]
22: (DispatchQueue::entry()+0x1219) [0x7f12b6b092f9]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f12b6bb9b9d]
24: (()+0x76ba) [0x7f12b56e56ba]
25: (clone()+0x6d) [0x7f12b4f0e41d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.km-fsn-1-dc4-m1-797679.log
Files