Support #16043
closedMDS is crashed
0%
Description
I updated ceph from hammer to jewel. After restart ceph daemons, 2 mds (from 5) not started.
I execute commands on all nodes:
chown -R ceph:ceph /var/lib/ceph /var/log/ceph
systemctl restart ceph
This not helped.
I running ceph-mds with debug options and getting message:
-2> 2016-05-26 14:48:16.493179 7fc2ea13d700 10 mds.2.cache |__ 2 auth [dir 10000040ef4 ~mds1/stray4/10000040ef4/ [2,head] auth v=43969 cv=0/0 dir_auth=2 state=1073741824 f(v15 m2015-05-05 17:18:02.723429) n(v4595 rc2015-05-05 17:18:02.723429) hs=0+0,ss=0+0 | subtree=1 0x555ef41a2900] -1> 2016-05-26 14:48:16.494021 7fc2ea13d700 10 mds.2.journal EImportStart.replay sessionmap 19084 < 19088 0> 2016-05-26 14:48:16.499302 7fc2ea13d700 -1 mds/journal.cc: In function 'virtual void EImportStart::replay(MDSRank*)' thread 7fc2ea13d700 time 2016-05-26 14:48:16.494038 mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
See attached log for more debug messages.
Files
Updated by Andrey Matyashov almost 8 years ago
- File mds.virt-master.log mds.virt-master.log added
Today all MDS in my cluster is died.
Updated by Greg Farnum almost 8 years ago
- Status changed from New to Need More Info
This probably isn't an issue any more, but if it is upgrade to 10.2.2 and report back if it's still an issue.
Updated by Andrey Matyashov almost 8 years ago
I upgraded my cluster to 10.2.2, situation not changed.
Updated by Andrey Matyashov almost 8 years ago
-3> 2016-06-16 16:52:51.903066 7fe964937700 1 -- 10.100.23.2:6812/29528 <== osd.0 10.100.23.2:6808/2753 2 ==== osd_op_reply(18 202.000001e4 [read 0~4194304 [fadvise_dontneed]] v0'0 uv101572 ondisk = 0) v7 ==== 132+0+4194304 (2530244843 0 4128880160) 0x559fdc863340 con 0x559fdc724400 -2> 2016-06-16 16:52:51.978692 7fe965940700 1 -- 10.100.23.2:6812/29528 <== osd.10 10.100.23.8:6802/3440 3 ==== osd_op_reply(20 202.000001e6 [read 0~4194304 [fadvise_dontneed]] v0'0 uv67580 ondisk = 0) v7 ==== 132+0+4194304 (3588191005 0 1415447401) 0x559fdc7b5180 con 0x559fdc723980 -1> 2016-06-16 16:52:52.016872 7fe96563d700 -1 log_channel(cluster) log [ERR] : replayed stray Session close event for client.16675885 10.100.23.8:0/38038 from time 2016-04-22 13:21:53.575796, ignoring 0> 2016-06-16 16:52:52.017777 7fe96563d700 -1 mds/journal.cc: In function 'virtual void ESession::replay(MDSRank*)' thread 7fe96563d700 time 2016-06-16 16:52:52.016885 mds/journal.cc: 1705: FAILED assert(mds->sessionmap.get_version() == cmapv) ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x559fd2066452] 2: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc] 3: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974] 4: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed] 5: (()+0x80a4) [0x7fe9727610a4] 6: (clone()+0x6d) [0x7fe970ca987d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 5/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 5/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 10000 max_new 1000 log_file --- end dump of recent events --- *** Caught signal (Aborted) ** in thread 7fe96563d700 thread_name:md_log_replay ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (()+0x4f8b87) [0x559fd1f61b87] 2: (()+0xf8d0) [0x7fe9727688d0] 3: (gsignal()+0x37) [0x7fe970bf6067] 4: (abort()+0x148) [0x7fe970bf7448] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626] 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc] 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974] 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed] 9: (()+0x80a4) [0x7fe9727610a4] 10: (clone()+0x6d) [0x7fe970ca987d] 2016-06-16 16:52:52.019817 7fe96563d700 -1 *** Caught signal (Aborted) ** in thread 7fe96563d700 thread_name:md_log_replay ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (()+0x4f8b87) [0x559fd1f61b87] 2: (()+0xf8d0) [0x7fe9727688d0] 3: (gsignal()+0x37) [0x7fe970bf6067] 4: (abort()+0x148) [0x7fe970bf7448] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626] 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc] 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974] 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed] 9: (()+0x80a4) [0x7fe9727610a4] 10: (clone()+0x6d) [0x7fe970ca987d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- 0> 2016-06-16 16:52:52.019817 7fe96563d700 -1 *** Caught signal (Aborted) ** in thread 7fe96563d700 thread_name:md_log_replay ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (()+0x4f8b87) [0x559fd1f61b87] 2: (()+0xf8d0) [0x7fe9727688d0] 3: (gsignal()+0x37) [0x7fe970bf6067] 4: (abort()+0x148) [0x7fe970bf7448] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x559fd2066626] 6: (ESession::replay(MDSRank*)+0x1ec) [0x559fd1f37dbc] 7: (MDLog::_replay_thread()+0x4f4) [0x559fd1ecf974] 8: (MDLog::ReplayThread::entry()+0xd) [0x559fd1c924ed] 9: (()+0x80a4) [0x7fe9727610a4] 10: (clone()+0x6d) [0x7fe970ca987d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 5/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 5/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 newstore 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 kinetic 1/ 5 fuse -2/-2 (syslog threshold) 99/99 (stderr threshold) max_recent 10000 max_new 1000 log_file --- end dump of recent events --- [????????????????] +++ killed by SIGABRT +++
Updated by Greg Farnum almost 8 years ago
Please set "debug mds = 20" and "debug mds log = 20" in your ceph.conf, turn it on, and then upload the mds log file using ceph-post-file.
Updated by Andrey Matyashov almost 8 years ago
Greg, I sent message with link to my debug log on your email. Service for ceph-post-file working has becomes unstable.
Today i success upload my logfile to ceph-post-file.
File name: ceph-mds.virt-master.log.bz2
Description: Debug log for BUG#16043
Upload tag: d5bd960a-a176-407e-b09f-6374c3e3cc4b
Thanks!
Updated by Greg Farnum almost 8 years ago
Yep. So looking through the log, I now see
mds.2.journal ESession.replay sessionmap 0 < 18884 close client.16675885 10.100.23.8:0/38038
Did you try and reset your SessionMap or something?
Also, mds.2...you appear to be running with multiple active MDSes? Which makes sense since I just noticed your initial error was on EImportStart (ie, a cross-MDS metadata migration).
At this point your best best is probably to try and flush out all the journals using the cephfs-journal-tool, to reduce down to a single a MDS using the debug tools, to reset the sessionmap, and start again in a more stable single-MDS configuration.
Updated by Andrey Matyashov almost 8 years ago
Yes, i try reset journal and sessions.
I run:
cephfs-journal-tool journal reset --force cephfs-table-tool all reset session
And try start mds
Message
-2> 2016-06-21 08:55:04.870978 7fb8a6b8b700 10 mds.2.journal ESession.replay sessionmap 0 < 18884 close client.16675885 10.100.23.8:0/38038
anyway there.
Maybe I do something wrong?
Thanks!
Updated by Andrey Matyashov almost 8 years ago
I execute
cephfs-journal-tool --rank=0 journal reset cephfs-journal-tool --rank=1 journal reset cephfs-journal-tool --rank=2 journal reset cephfs-journal-tool --rank=3 journal reset
and mds running successfull!
root@virt-master:~# ceph -s cluster f53d4a19-b2c0-4a92-9620-bc6e3bfc27d6 health HEALTH_WARN mds cluster is degraded monmap e17: 5 mons at {virt-master=10.100.23.2:6789/0,virt-node-02=10.100.23.3:6789/0,virt-node-03=10.100.23.4:6789/0,virt-node-05=10.100.23.7:6789/0,virt-node-06=10.100.23.8:6789/0} election epoch 3936, quorum 0,1,2,3,4 virt-master,virt-node-02,virt-node-03,virt-node-05,virt-node-06 fsmap e143745: 1/4/1 up {2=virt-master=up:resolve}, 4 up:standby osdmap e141148: 15 osds: 15 up, 15 in pgmap v44800918: 712 pgs, 12 pools, 7187 GB data, 1803 kobjects 14390 GB used, 13465 GB / 27856 GB avail 712 active+clean
Thanks for your help!
Updated by Greg Farnum almost 8 years ago
- Tracker changed from Bug to Support
- Status changed from Need More Info to Closed
Updated by John Spray over 7 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.