Project

General

Profile

Bug #41026

MDS process crashes on 14.2.2

Added by super xor 20 days ago. Updated 20 days ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
07/31/2019
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:

Description

MDS Process on Ubuntu 18.04 Nautilus 14.2.2 are crashing, unable to recover

7> 2019-07-31 13:29:46.888 7fb36a61a700 -1 --2 [v2:10.3.0.1:6800/2730552661,v1:10.3.0.1:6803/2730552661] >> [v2:10.3.0.242:$
6> 2019-07-31 13:29:46.888 7fb367465700 1 mds.1.objecter ms_handle_reset 0x4b4df80 session 0xd91c840 osd.179
-5> 2019-07-31 13:29:46.888 7fb36a61a700 10 monclient: get_auth_request con 0x5666f600 auth_method 0
-4> 2019-07-31 13:29:46.888 7fb36ae1b700 -1 --2
[v2:10.3.0.1:6800/2730552661,v1:10.3.0.1:6803/2730552661] >> [v2:10.3.0.242:$
-3> 2019-07-31 13:29:46.888 7fb367465700 1 mds.1.objecter ms_handle_reset 0x5666e880 session 0xd91cf20 osd.0
-2> 2019-07-31 13:29:46.888 7fb367465700 4 mds.1.server handle_client_request client_request(client.25568318:505 lookup #0x2$
-1> 2019-07-31 13:29:46.888 7fb36ae1b700 10 monclient: get_auth_request con 0x5666fa80 auth_method 0
0> 2019-07-31 13:29:46.888 7fb36b61c700 -1 ** Caught signal (Aborted) *
in thread 7fb36b61c700 thread_name:msgr-worker-0
ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
1: (()+0x11390) [0x7fb36f571390]
2: (gsignal()+0x38) [0x7fb36ecbe428]
3: (abort()+0x16a) [0x7fb36ecc002a]
4: (_gnu_cxx::_verbose_terminate_handler()+0x135) [0x7fb3702a7155]
5: (_cxxabiv1::_terminate(void (*)())+0x6) [0x7fb37029b136]
6: (()+0x8ad181) [0x7fb37029b181]
7: (()+0x91568e) [0x7fb37030368e]
8: (()+0x76ba) [0x7fb36f5676ba]
9: (clone()+0x6d) [0x7fb36ed9041d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by super xor 20 days ago

After trying to fix the server, SINGLE MDS setup now:
   -18> 2019-07-31 17:59:21.339 7f11df8bf700  4 mds.0.purge_queue operator(): open complete
   -17> 2019-07-31 17:59:21.339 7f11df8bf700  1 mds.0.journaler.pq(ro) set_writeable
   -16> 2019-07-31 17:59:21.343 7f11e8a80700 10 monclient: get_auth_request con 0x387a400 auth_method 0
   -15> 2019-07-31 17:59:21.343 7f11de8bd700  1 mds.0.journaler.mdlog(ro) _finish_read_head loghead(trim 27370815750144, expire 27370817825265, write 27370928226441, stream_format 1).  probing for end of log (from 27370928226441)...
   -14> 2019-07-31 17:59:21.343 7f11de8bd700  1 mds.0.journaler.mdlog(ro) probing for end of the log
   -13> 2019-07-31 17:59:21.427 7f11de8bd700  1 mds.0.journaler.mdlog(ro) _finish_probe_end write_pos = 27370928240437 (header had 27370928226441). recovered.
   -12> 2019-07-31 17:59:21.427 7f11de0bc700  4 mds.0.log Journal 0x200 recovered.
   -11> 2019-07-31 17:59:21.427 7f11de0bc700  4 mds.0.log Recovered journal 0x200 in format 1
   -10> 2019-07-31 17:59:21.427 7f11de0bc700  2 mds.0.78866 Booting: 1: loading/discovering base inodes
    -9> 2019-07-31 17:59:21.427 7f11de0bc700  0 mds.0.cache creating system inode with ino:0x100
    -8> 2019-07-31 17:59:21.427 7f11de0bc700  0 mds.0.cache creating system inode with ino:0x1
    -7> 2019-07-31 17:59:21.427 7f11de8bd700  2 mds.0.78866 Booting: 2: replaying mds log
    -6> 2019-07-31 17:59:21.427 7f11de8bd700  2 mds.0.78866 Booting: 2: waiting for purge queue recovered
    -5> 2019-07-31 17:59:21.479 7f11e48c9700  4 mds.0.78866 handle_osd_map epoch 208882, 0 new blacklist entries
    -4> 2019-07-31 17:59:21.479 7f11e48c9700 10 monclient: _renew_subs
    -3> 2019-07-31 17:59:21.479 7f11e48c9700 10 monclient: _send_mon_message to mon.km-fsn-1-dc4-m1-797678 at v2:10.3.0.1:3300/0
    -2> 2019-07-31 17:59:21.663 7f11dd0ba700 -1 log_channel(cluster) log [ERR] : ESession.replay sessionmap v 825175264 - 1 > table 0
    -1> 2019-07-31 17:59:21.663 7f11dd0ba700 -1 /build/ceph-14.2.2/src/mds/journal.cc: In function 'virtual void ESession::replay(MDSRank*)' thread 7f11dd0ba700 time 2019-07-31 17:59:21.666728
/build/ceph-14.2.2/src/mds/journal.cc: 1655: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f11ed133bb2]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f11ed133d8d]
 3: (ESession::replay(MDSRank*)+0xfa0) [0x809030]
 4: (MDLog::_replay_thread()+0x892) [0x7a7432]
 5: (MDLog::ReplayThread::entry()+0xd) [0x50ab6d]
 6: (()+0x76ba) [0x7f11ec9cb6ba]
 7: (clone()+0x6d) [0x7f11ec1f441d]

     0> 2019-07-31 17:59:21.663 7f11dd0ba700 -1 *** Caught signal (Aborted) **
 in thread 7f11dd0ba700 thread_name:md_log_replay

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (()+0x11390) [0x7f11ec9d5390]
 2: (gsignal()+0x38) [0x7f11ec122428]
 3: (abort()+0x16a) [0x7f11ec12402a]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a3) [0x7f11ed133c03]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f11ed133d8d]
 6: (ESession::replay(MDSRank*)+0xfa0) [0x809030]
 7: (MDLog::_replay_thread()+0x892) [0x7a7432]
 8: (MDLog::ReplayThread::entry()+0xd) [0x50ab6d]
 9: (()+0x76ba) [0x7f11ec9cb6ba]
 10: (clone()+0x6d) [0x7f11ec1f441d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.km-fsn-1-dc4-m1-797678.log

#2 Updated by Kefu Chai 20 days ago

  • Project changed from Ceph to fs

#3 Updated by Patrick Donnelly 20 days ago

  • Status changed from New to Rejected

Please seek help on ceph-users. Provide more information about your cluster and how the error came about.

Also available in: Atom PDF