Bug #21337
luminous: MDS is not getting past up:replay on Luminous cluster
0%
Description
On my Luminous 12.2.0 test cluster, after I have run for the last few days, the MDS process is not getting past up:replay before the log shows:
2017-09-10 23:08:49.109923 7ffabcee8700 -1 ** Caught signal (Aborted) *
in thread 7ffabcee8700 thread_name:md_log_replay
Then the processes is restarted by systemd.
The ceph-mds process runs at ~120% cpu for several minutes, seems to be reading in lots of data from the OSDs, as shown by network stats and client I/O stats, then the MDS process exits with the Caught signal (Aborted) message:
This was happening on all three of my MDS servers, as one would fail, and the next would try to replay, and then it would fail. To make debugging simpler, I have gone to a single MDS by disabling my other 2 MDS servers, set debug_mds = 9/20, and attached the log file.
From this log:
2017-09-11 02:01:55.730578 7f40e5611700 5 mds.ede-c2-mon03 handle_mds_map epoch 29027 from mon.1 2017-09-11 02:02:00.086795 7f40e5611700 5 mds.ede-c2-mon03 handle_mds_map epoch 29028 from mon.1 2017-09-11 02:02:04.889403 7f40dde02700 0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob 2017-09-11 02:02:04.909457 7f40dde02700 -1 /build/ceph-12.2.0/src/mds/journal.cc: In function 'virtual void EOpen::replay(MDSRank*)' thread 7f40dde02700 time 2017-09-11 02:02:04.889408 /build/ceph-12.2.0/src/mds/journal.cc: 2214: FAILED assert(in) ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x56184cb6d132] 2: (EOpen::replay(MDSRank*)+0x776) [0x56184cb0d796] 3: (MDLog::_replay_thread()+0xc6b) [0x56184cac016b] 4: (MDLog::ReplayThread::entry()+0xd) [0x56184c84007d] 5: (()+0x76ba) [0x7f40eb1016ba] 6: (clone()+0x6d) [0x7f40ea16d82d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This problem is very repeatable on the system right now. Using grep on the log file with default logging shows:
grep -C1 "Caught signal" ceph-mds.ede-c2-mon03.log --- end dump of recent events --- 2017-09-10 20:47:35.512059 7f7701f53700 -1 *** Caught signal (Aborted) ** in thread 7f7701f53700 thread_name:md_log_replay -- -1> 2017-09-10 20:47:35.174509 7f770bf02700 1 -- 10.15.2.123:6800/4203097469 <== osd.0 10.15.2.132:6802/1957 940 ==== osd_op_reply(16961 200.0003150d [read 0~4194304 [fadvise_dontneed]] v0'0 uv50294 ondisk = 0) v8 ==== 156+0+4194304 (3089052591 0 939397831) 0x55e9be794380 con 0x55e9be95a000 0> 2017-09-10 20:47:35.512059 7f7701f53700 -1 *** Caught signal (Aborted) ** in thread 7f7701f53700 thread_name:md_log_replay -- --- end dump of recent events --- 2017-09-10 21:16:03.722304 7f6334254700 -1 *** Caught signal (Aborted) ** in thread 7f6334254700 thread_name:md_log_replay -- -1> 2017-09-10 21:16:03.345563 7f633ea04700 1 -- 10.15.2.123:6800/3477605165 <== osd.2 10.15.2.131:6802/1842 757 ==== osd_op_reply(16962 200.0003150e [read 0~4194304 [fadvise_dontneed]] v0'0 uv50000 ondisk = 0) v8 ==== 156+0+4194304 (2529611885 0 3069046118) 0x55eabccebc00 con 0x55eabcd68000 0> 2017-09-10 21:16:03.722304 7f6334254700 -1 *** Caught signal (Aborted) ** in thread 7f6334254700 thread_name:md_log_replay -- --- end dump of recent events --- 2017-09-10 21:44:36.999431 7f9921cfe700 -1 *** Caught signal (Aborted) ** in thread 7f9921cfe700 thread_name:md_log_replay -- -1> 2017-09-10 21:44:36.827626 7f992bcad700 5 -- 10.15.2.123:6800/2182054169 >> 10.15.2.123:6789/0 conn(0x559f1bcab000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=87 cs=1 l=1). rx mon.2 seq 786 0x559ff1d0f500 mdsbeacon(344336/ede-c2-mon03 up:replay seq 429 v25639) v7 0> 2017-09-10 21:44:36.999431 7f9921cfe700 -1 *** Caught signal (Aborted) ** in thread 7f9921cfe700 thread_name:md_log_replay -- --- end dump of recent events --- 2017-09-10 21:59:13.121796 7f2b6a7ee700 -1 *** Caught signal (Aborted) ** in thread 7f2b6a7ee700 thread_name:md_log_replay -- -1> 2017-09-10 21:59:12.838076 7f2b71ffd700 1 -- 10.15.2.123:6800/1508149898 <== mon.2 10.15.2.123:6789/0 556 ==== mdsmap(e 25973) v1 ==== 721+0+0 (2764755316 0 0) 0x55f77ade9400 con 0x55f6a8342800 0> 2017-09-10 21:59:13.121796 7f2b6a7ee700 -1 *** Caught signal (Aborted) ** in thread 7f2b6a7ee700 thread_name:md_log_replay --
System information
OS Ubuntu 16.04 with a 4.13.0 kernel uname -a Linux ede-c2-mon03 4.13.0-041300-generic #201709031731 SMP Sun Sep 3 21:33:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux ceph -s cluster: id: 6d8aae1e-1125-11e5-a708-001b78e265be health: HEALTH_WARN 1 filesystem is degraded insufficient standby MDS daemons available 1 MDSs behind on trimming services: mon: 3 daemons, quorum ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mgr: ede-c2-mon01(active), standbys: ede-c2-mon02, ede-c2-mon03 mds: cephfs-1/1/1 up {0=ede-c2-mon03=up:replay} osd: 19 osds: 19 up, 19 in data: pools: 4 pools, 1792 pgs objects: 95597 objects, 261 GB usage: 799 GB used, 496 GB / 1296 GB avail pgs: 1792 active+clean io: client: 74267 kB/s rd, 18 op/s rd, 0 op/s wr ceph -v ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc) free total used free shared buff/cache available Mem: 16427028 722084 201700 118848 15503244 15226820 Swap: 999420 0 999420
NOTE: CephFS snapshots are turned on
History
#1 Updated by Zheng Yan over 6 years ago
please set debug_mds=10, restart mds and upload the full log. To recover the situation, just replace the 'assert(in)' with 'continue'.
#2 Updated by Eric Eastman over 6 years ago
- File ceph-mds.ede-c2-mon03.log-11Sep17.50K.bz2 added
The log file with debug_mds=10 from MDS startup to reaching the assert is 110GB. I am attaching the last 50K lines. If you want the whole file, let me know, and I will compress it and put it on a public ftp server.
#3 Updated by Eric Eastman over 6 years ago
Replacing the 'assert(in)' with 'continue' did get the Ceph file system working again. Looking at the log, there were 430 times it hit the new continue statement. Was there any data loss caused by changing the 'assert(in)' with 'continue'? The errors looked like:
2017-09-12 12:25:06.736870 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob
2017-09-12 12:38:59.158432 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x100045cb414.170 not in metablob
#4 Updated by Zheng Yan over 6 years ago
Eric Eastman wrote:
Replacing the 'assert(in)' with 'continue' did get the Ceph file system working again. Looking at the log, there were 430 times it hit the new continue statement. Was there any data loss caused by changing the 'assert(in)' with 'continue'? The errors looked like:
2017-09-12 12:25:06.736870 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob
2017-09-12 12:38:59.158432 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x100045cb414.170 not in metablob
The bug affect snapshotted metadata. metadata of snapshotted inode can be slight outdate (not reflect metadata of inode when snapshot was created).
You mentioned 3 MDS. do you mean running 3 active mds? snapshot support on multimds cluster are buggy for luminous.(see RP https://github.com/ceph/ceph/pull/16779)
#5 Updated by Eric Eastman over 6 years ago
One active, one standby-replay, one standby as shown:
mds: cephfs-1/1/1 up {0=ede-c2-mon02=up:active}, 1 up:standby-replay, 1 up:standby
As snapshots are more important to our workflow then either multiple MDSs or multiple file systems, we are running with 1 active MDS and only 1 file system. We are looking forward to the day when snapshots are fully supported.
#6 Updated by Zheng Yan over 6 years ago
Eric Eastman wrote:
The log file with debug_mds=10 from MDS startup to reaching the assert is 110GB. I am attaching the last 50K lines. If you want the whole file, let me know, and I will compress it and put it on a public ftp server.
how large is the log from 'replay start' to first appearance of "EOpen.replay ino xxx not in metablob". could you upload the log.
#7 Updated by Eric Eastman over 6 years ago
The entire log using bzip2 compressed down to 4.6G. You can download it from:
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/ceph-mds.ede-c2-mon03.log-11Sep17.bz2
#8 Updated by Patrick Donnelly over 6 years ago
- Assignee set to Zheng Yan
#9 Updated by Zheng Yan over 6 years ago
- Subject changed from MDS is not getting past up:replay on Luminous cluster to luminous: MDS is not getting past up:replay on Luminous cluster
- Status changed from New to Fix Under Review
- Backport set to luminous
- Affected Versions v12.0.0 added
https://github.com/ceph/ceph/pull/17994
patch is only for luminous. this bug has been fixed in master in another way (change is much bigger)
#10 Updated by Zheng Yan over 6 years ago
- Release set to luminous
#11 Updated by Nathan Cutler over 6 years ago
- Tracker changed from Bug to Backport
- Status changed from Fix Under Review to Resolved
- Target version set to v12.2.2
#12 Updated by Nathan Cutler over 6 years ago
- Tracker changed from Backport to Bug
- % Done set to 0
- Backport set to luminous
- Regression set to No
- Severity set to 3 - minor