Project

General

Profile

Actions

Support #16043

closed

MDS is crashed

Added by Andrey Matyashov almost 8 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

I updated ceph from hammer to jewel. After restart ceph daemons, 2 mds (from 5) not started.

I execute commands on all nodes:
chown -R ceph:ceph /var/lib/ceph /var/log/ceph
systemctl restart ceph

This not helped.

I running ceph-mds with debug options and getting message:

-2> 2016-05-26 14:48:16.493179 7fc2ea13d700 10 mds.2.cache   |__ 2    auth [dir 10000040ef4 ~mds1/stray4/10000040ef4/ [2,head] auth v=43969 cv=0/0 dir_auth=2 state=1073741824 f(v15 m2015-05-05 17:18:02.723429) n(v4595 rc2015-05-05 17:18:02.723429) hs=0+0,ss=0+0 | subtree=1 0x555ef41a2900]
-1> 2016-05-26 14:48:16.494021 7fc2ea13d700 10 mds.2.journal EImportStart.replay sessionmap 19084 < 19088
0> 2016-05-26 14:48:16.499302 7fc2ea13d700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7fc2ea13d700 time 2016-05-26 14:48:16.494038
mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)

See attached log for more debug messages.


Files

mds.virt-node-03.log (453 KB) mds.virt-node-03.log Andrey Matyashov, 05/26/2016 12:56 PM
mds.virt-master.log (72.3 KB) mds.virt-master.log Andrey Matyashov, 05/27/2016 09:43 AM
Actions #1

Updated by Andrey Matyashov almost 8 years ago

Today all MDS in my cluster is died.

Actions #2

Updated by Greg Farnum almost 8 years ago

  • Status changed from New to Need More Info

This probably isn't an issue any more, but if it is upgrade to 10.2.2 and report back if it's still an issue.

Actions #3

Updated by Andrey Matyashov almost 8 years ago

I upgraded my cluster to 10.2.2, situation not changed.

Actions #4

Updated by Andrey Matyashov almost 8 years ago

-3> 2016-06-16 16:52:51.903066 7fe964937700  1 -- 10.100.23.2:6812/29528 <== osd.0 10.100.23.2:6808/2753 2 ==== osd_op_reply(18 202.000001e4 [read 0~4194304 [fadvise_dontneed]] v0'0 uv101572 ondisk = 0) v7 ==== 132+0+4194304 (2530244843 0 4128880160) 0x559fdc863340 con 0x559fdc724400
    -2> 2016-06-16 16:52:51.978692 7fe965940700  1 -- 10.100.23.2:6812/29528 <== osd.10 10.100.23.8:6802/3440 3 ==== osd_op_reply(20 202.000001e6 [read 0~4194304 [fadvise_dontneed]] v0'0 uv67580 ondisk = 0) v7 ==== 132+0+4194304 (3588191005 0 1415447401) 0x559fdc7b5180 con 0x559fdc723980
    -1> 2016-06-16 16:52:52.016872 7fe96563d700 -1 log_channel(cluster) log [ERR] : replayed stray Session close event for client.16675885 10.100.23.8:0/38038 from time 2016-04-22 13:21:53.575796, ignoring
     0> 2016-06-16 16:52:52.017777 7fe96563d700 -1 mds/journal.cc: In function 'virtual void ESession::replay(MDSRank*)' thread 7fe96563d700 time 2016-06-16 16:52:52.016885
mds/journal.cc: 1705: FAILED assert(mds->sessionmap.get_version() == cmapv)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x559fd2066452]
 2: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc]
 3: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974]
 4: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed]
 5: (()+0x80a4) [0x7fe9727610a4]
 6: (clone()+0x6d) [0x7fe970ca987d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   5/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   5/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file 
--- end dump of recent events ---
*** Caught signal (Aborted) **
 in thread 7fe96563d700 thread_name:md_log_replay
 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x4f8b87) [0x559fd1f61b87]
 2: (()+0xf8d0) [0x7fe9727688d0]
 3: (gsignal()+0x37) [0x7fe970bf6067]
 4: (abort()+0x148) [0x7fe970bf7448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626]
 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc]
 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974]
 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed]
 9: (()+0x80a4) [0x7fe9727610a4]
 10: (clone()+0x6d) [0x7fe970ca987d]
2016-06-16 16:52:52.019817 7fe96563d700 -1 *** Caught signal (Aborted) **
 in thread 7fe96563d700 thread_name:md_log_replay

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x4f8b87) [0x559fd1f61b87]
 2: (()+0xf8d0) [0x7fe9727688d0]
 3: (gsignal()+0x37) [0x7fe970bf6067]
 4: (abort()+0x148) [0x7fe970bf7448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626]
 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc]
 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974]
 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed]
 9: (()+0x80a4) [0x7fe9727610a4]
 10: (clone()+0x6d) [0x7fe970ca987d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2016-06-16 16:52:52.019817 7fe96563d700 -1 *** Caught signal (Aborted) **
 in thread 7fe96563d700 thread_name:md_log_replay

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x4f8b87) [0x559fd1f61b87]
 2: (()+0xf8d0) [0x7fe9727688d0]
 3: (gsignal()+0x37) [0x7fe970bf6067]
 4: (abort()+0x148) [0x7fe970bf7448]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626]
 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc]
 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974]
 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed]
 9: (()+0x80a4) [0x7fe9727610a4]
 10: (clone()+0x6d) [0x7fe970ca987d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   5/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   5/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  99/99 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file 
--- end dump of recent events ---
[????????????????] +++ killed by SIGABRT +++
Actions #5

Updated by Greg Farnum almost 8 years ago

Please set "debug mds = 20" and "debug mds log = 20" in your ceph.conf, turn it on, and then upload the mds log file using ceph-post-file.

Actions #6

Updated by Andrey Matyashov almost 8 years ago

Greg, I sent message with link to my debug log on your email. Service for ceph-post-file working has becomes unstable.
Today i success upload my logfile to ceph-post-file.
File name: ceph-mds.virt-master.log.bz2
Description: Debug log for BUG#16043
Upload tag: d5bd960a-a176-407e-b09f-6374c3e3cc4b

Thanks!

Actions #7

Updated by Greg Farnum almost 8 years ago

Yep. So looking through the log, I now see

mds.2.journal ESession.replay sessionmap 0 < 18884 close client.16675885 10.100.23.8:0/38038

Did you try and reset your SessionMap or something?

Also, mds.2...you appear to be running with multiple active MDSes? Which makes sense since I just noticed your initial error was on EImportStart (ie, a cross-MDS metadata migration).

At this point your best best is probably to try and flush out all the journals using the cephfs-journal-tool, to reduce down to a single a MDS using the debug tools, to reset the sessionmap, and start again in a more stable single-MDS configuration.

Actions #8

Updated by Andrey Matyashov almost 8 years ago

Yes, i try reset journal and sessions.

I run:

cephfs-journal-tool journal reset --force
cephfs-table-tool all reset session

And try start mds

Message

-2> 2016-06-21 08:55:04.870978 7fb8a6b8b700 10 mds.2.journal ESession.replay sessionmap 0 < 18884 close client.16675885 10.100.23.8:0/38038

anyway there.

Maybe I do something wrong?

Thanks!

Actions #9

Updated by Andrey Matyashov almost 8 years ago

I execute

cephfs-journal-tool --rank=0 journal reset
cephfs-journal-tool --rank=1 journal reset
cephfs-journal-tool --rank=2 journal reset
cephfs-journal-tool --rank=3 journal reset

and mds running successfull!

root@virt-master:~# ceph -s
    cluster f53d4a19-b2c0-4a92-9620-bc6e3bfc27d6
     health HEALTH_WARN
            mds cluster is degraded
     monmap e17: 5 mons at {virt-master=10.100.23.2:6789/0,virt-node-02=10.100.23.3:6789/0,virt-node-03=10.100.23.4:6789/0,virt-node-05=10.100.23.7:6789/0,virt-node-06=10.100.23.8:6789/0}
            election epoch 3936, quorum 0,1,2,3,4 virt-master,virt-node-02,virt-node-03,virt-node-05,virt-node-06
      fsmap e143745: 1/4/1 up {2=virt-master=up:resolve}, 4 up:standby
     osdmap e141148: 15 osds: 15 up, 15 in
      pgmap v44800918: 712 pgs, 12 pools, 7187 GB data, 1803 kobjects
            14390 GB used, 13465 GB / 27856 GB avail
                 712 active+clean

Thanks for your help!

Actions #10

Updated by Greg Farnum almost 8 years ago

  • Tracker changed from Bug to Support
  • Status changed from Need More Info to Closed
Actions #11

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF