Project

General

Profile

Actions

Bug #41025

open

2/3 mon process crash - complete cluster failure

Added by Anonymous almost 5 years ago. Updated almost 5 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We upgraded to Nautilus 14.2.2 (Ubuntu 18.04) a few days ago and today had some issues with MDS ('ll open another report).
First our MDS procesess died uneable to come by and during trying to fix MDS (decreasing ranks etc) now 2/3 mon processes are dead.

Logs show:

-2> 2019-07-31 15:01:31.581 7f12a96fb700 10 log_client will send 2019-07-31 15:01:31.582687 mon.km-fsn-1-dc4-m1-797679 (mon.1) 4 : cluster [DBG] monmap e4: 3 mons at {km-fsn-1-dc4-m1-797678=[v2:10.3.0.1:3300/0,v1:10.3.0.1:6789/0],km-fsn-1-dc4-m1-797679=[v2:10.3.0.2:3300/0,v1:10.3.0.2:6789/0],km-fsn-1-dc4-m1-797680=[v2:10.3.0.3:3300/0,v1:10.3.0.3:6789/0]}
-1> 2019-07-31 15:01:31.581 7f12a96fb700 5 mon.km-fsn-1-dc4-m1-797679@1(leader).paxos(paxos active c 38758417..38759000) is_readable = 1 - now=2019-07-31 15:01:31.582708 lease_expire=2019-07-31 15:01:36.581412 has v0 lc 38759000
0> 2019-07-31 15:01:31.585 7f12a96fb700 -1 ** Caught signal (Aborted) *
in thread 7f12a96fb700 thread_name:ms_dispatch

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
1: (()+0x11390) [0x7f12b56ef390]
2: (gsignal()+0x38) [0x7f12b4e3c428]
3: (abort()+0x16a) [0x7f12b4e3e02a]
4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x7f12b6ec1155]
5: (_cxxabiv1::_terminate(void ()())+0x6) [0x7f12b6eb5136]
6: (()+0x8ad181) [0x7f12b6eb5181]
7: (()+0x8b9394) [0x7f12b6ec1394]
8: (std::__throw_out_of_range(char const
)+0x3f) [0x7f12b6eccabf]
9: (MDSMonitor::maybe_resize_cluster(FSMap&, int)+0xcf0) [0x86db50]
10: (MDSMonitor::tick()+0xc9) [0x86f3b9]
11: (MDSMonitor::on_active()+0x28) [0x858a88]
12: (PaxosService::_active()+0xdd) [0x7ac0dd]
13: (Context::complete(int)+0x9) [0x6af0b9]
14: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa8) [0x6d7e68]
15: (Paxos::finish_round()+0x76) [0x7a2826]
16: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xbff) [0x7a3a2f]
17: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x24b) [0x7a44db]
18: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) [0x6a9025]
19: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x6a9672]
20: (Monitor::ms_dispatch(Message*)+0x26) [0x6d9506]
21: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x6d5576]
22: (DispatchQueue::entry()+0x1219) [0x7f12b6b092f9]
23: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f12b6bb9b9d]
24: (()+0x76ba) [0x7f12b56e56ba]
25: (clone()+0x6d) [0x7f12b4f0e41d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mon.km-fsn-1-dc4-m1-797679.log


Files

cephfail.log (157 KB) cephfail.log Anonymous, 07/31/2019 01:29 PM
Actions

Also available in: Atom PDF