Project

General

Profile

Bug #21337

luminous: MDS is not getting past up:replay on Luminous cluster

Added by Eric Eastman over 6 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On my Luminous 12.2.0 test cluster, after I have run for the last few days, the MDS process is not getting past up:replay before the log shows:

2017-09-10 23:08:49.109923 7ffabcee8700 -1 ** Caught signal (Aborted) *
in thread 7ffabcee8700 thread_name:md_log_replay

Then the processes is restarted by systemd.

The ceph-mds process runs at ~120% cpu for several minutes, seems to be reading in lots of data from the OSDs, as shown by network stats and client I/O stats, then the MDS process exits with the Caught signal (Aborted) message:

This was happening on all three of my MDS servers, as one would fail, and the next would try to replay, and then it would fail. To make debugging simpler, I have gone to a single MDS by disabling my other 2 MDS servers, set debug_mds = 9/20, and attached the log file.

From this log:

2017-09-11 02:01:55.730578 7f40e5611700  5 mds.ede-c2-mon03 handle_mds_map epoch 29027 from mon.1
2017-09-11 02:02:00.086795 7f40e5611700  5 mds.ede-c2-mon03 handle_mds_map epoch 29028 from mon.1
2017-09-11 02:02:04.889403 7f40dde02700  0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob
2017-09-11 02:02:04.909457 7f40dde02700 -1 /build/ceph-12.2.0/src/mds/journal.cc: In function 'virtual void EOpen::replay(MDSRank*)' thread 7f40dde02700 time 2017-09-11 02:02:04.889408
/build/ceph-12.2.0/src/mds/journal.cc: 2214: FAILED assert(in)

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x56184cb6d132]
 2: (EOpen::replay(MDSRank*)+0x776) [0x56184cb0d796]
 3: (MDLog::_replay_thread()+0xc6b) [0x56184cac016b]
 4: (MDLog::ReplayThread::entry()+0xd) [0x56184c84007d]
 5: (()+0x76ba) [0x7f40eb1016ba]
 6: (clone()+0x6d) [0x7f40ea16d82d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

This problem is very repeatable on the system right now. Using grep on the log file with default logging shows:

grep -C1 "Caught signal" ceph-mds.ede-c2-mon03.log
--- end dump of recent events ---
2017-09-10 20:47:35.512059 7f7701f53700 -1 *** Caught signal (Aborted) **
 in thread 7f7701f53700 thread_name:md_log_replay
--
    -1> 2017-09-10 20:47:35.174509 7f770bf02700  1 -- 10.15.2.123:6800/4203097469 <== osd.0 10.15.2.132:6802/1957 940 ==== osd_op_reply(16961 200.0003150d [read 0~4194304 [fadvise_dontneed]] v0'0 uv50294 ondisk = 0) v8 ==== 156+0+4194304 (3089052591 0 939397831) 0x55e9be794380 con 0x55e9be95a000
     0> 2017-09-10 20:47:35.512059 7f7701f53700 -1 *** Caught signal (Aborted) **
 in thread 7f7701f53700 thread_name:md_log_replay
--
--- end dump of recent events ---
2017-09-10 21:16:03.722304 7f6334254700 -1 *** Caught signal (Aborted) **
 in thread 7f6334254700 thread_name:md_log_replay
--
    -1> 2017-09-10 21:16:03.345563 7f633ea04700  1 -- 10.15.2.123:6800/3477605165 <== osd.2 10.15.2.131:6802/1842 757 ==== osd_op_reply(16962 200.0003150e [read 0~4194304 [fadvise_dontneed]] v0'0 uv50000 ondisk = 0) v8 ==== 156+0+4194304 (2529611885 0 3069046118) 0x55eabccebc00 con 0x55eabcd68000
     0> 2017-09-10 21:16:03.722304 7f6334254700 -1 *** Caught signal (Aborted) **
 in thread 7f6334254700 thread_name:md_log_replay
--
--- end dump of recent events ---
2017-09-10 21:44:36.999431 7f9921cfe700 -1 *** Caught signal (Aborted) **
 in thread 7f9921cfe700 thread_name:md_log_replay
--
    -1> 2017-09-10 21:44:36.827626 7f992bcad700  5 -- 10.15.2.123:6800/2182054169 >> 10.15.2.123:6789/0 conn(0x559f1bcab000 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=87 cs=1 l=1). rx mon.2 seq 786 0x559ff1d0f500 mdsbeacon(344336/ede-c2-mon03 up:replay seq 429 v25639) v7
     0> 2017-09-10 21:44:36.999431 7f9921cfe700 -1 *** Caught signal (Aborted) **
 in thread 7f9921cfe700 thread_name:md_log_replay
--
--- end dump of recent events ---
2017-09-10 21:59:13.121796 7f2b6a7ee700 -1 *** Caught signal (Aborted) **
 in thread 7f2b6a7ee700 thread_name:md_log_replay
--
    -1> 2017-09-10 21:59:12.838076 7f2b71ffd700  1 -- 10.15.2.123:6800/1508149898 <== mon.2 10.15.2.123:6789/0 556 ==== mdsmap(e 25973) v1 ==== 721+0+0 (2764755316 0 0) 0x55f77ade9400 con 0x55f6a8342800
     0> 2017-09-10 21:59:13.121796 7f2b6a7ee700 -1 *** Caught signal (Aborted) **
 in thread 7f2b6a7ee700 thread_name:md_log_replay
--

System information
OS Ubuntu 16.04 with a 4.13.0 kernel
uname -a
Linux ede-c2-mon03 4.13.0-041300-generic #201709031731 SMP Sun Sep 3 21:33:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ceph -s
  cluster:
    id:     6d8aae1e-1125-11e5-a708-001b78e265be
    health: HEALTH_WARN
            1 filesystem is degraded
            insufficient standby MDS daemons available
            1 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
    mgr: ede-c2-mon01(active), standbys: ede-c2-mon02, ede-c2-mon03
    mds: cephfs-1/1/1 up  {0=ede-c2-mon03=up:replay}
    osd: 19 osds: 19 up, 19 in

  data:
    pools:   4 pools, 1792 pgs
    objects: 95597 objects, 261 GB
    usage:   799 GB used, 496 GB / 1296 GB avail
    pgs:     1792 active+clean

  io:
    client:   74267 kB/s rd, 18 op/s rd, 0 op/s wr

ceph -v
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)

free
              total        used        free      shared  buff/cache   available
Mem:       16427028      722084      201700      118848    15503244    15226820
Swap:        999420           0      999420

NOTE: CephFS snapshots are turned on

ceph-mds.ede-c2-mon03.log-d9-20.gz (201 KB) Eric Eastman, 09/11/2017 06:38 AM

ceph-mds.ede-c2-mon03.log-11Sep17.50K.bz2 (575 KB) Eric Eastman, 09/11/2017 07:31 PM

History

#1 Updated by Zheng Yan over 6 years ago

please set debug_mds=10, restart mds and upload the full log. To recover the situation, just replace the 'assert(in)' with 'continue'.

#2 Updated by Eric Eastman over 6 years ago

The log file with debug_mds=10 from MDS startup to reaching the assert is 110GB. I am attaching the last 50K lines. If you want the whole file, let me know, and I will compress it and put it on a public ftp server.

#3 Updated by Eric Eastman over 6 years ago

Replacing the 'assert(in)' with 'continue' did get the Ceph file system working again. Looking at the log, there were 430 times it hit the new continue statement. Was there any data loss caused by changing the 'assert(in)' with 'continue'? The errors looked like:

2017-09-12 12:25:06.736870 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob
2017-09-12 12:38:59.158432 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x100045cb414.170 not in metablob

#4 Updated by Zheng Yan over 6 years ago

Eric Eastman wrote:

Replacing the 'assert(in)' with 'continue' did get the Ceph file system working again. Looking at the log, there were 430 times it hit the new continue statement. Was there any data loss caused by changing the 'assert(in)' with 'continue'? The errors looked like:

2017-09-12 12:25:06.736870 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x1000411038c.156 not in metablob
2017-09-12 12:38:59.158432 7f2bb735e700 0 mds.0.journal EOpen.replay ino 0x100045cb414.170 not in metablob

The bug affect snapshotted metadata. metadata of snapshotted inode can be slight outdate (not reflect metadata of inode when snapshot was created).

You mentioned 3 MDS. do you mean running 3 active mds? snapshot support on multimds cluster are buggy for luminous.(see RP https://github.com/ceph/ceph/pull/16779)

#5 Updated by Eric Eastman over 6 years ago

One active, one standby-replay, one standby as shown:

mds: cephfs-1/1/1 up {0=ede-c2-mon02=up:active}, 1 up:standby-replay, 1 up:standby

As snapshots are more important to our workflow then either multiple MDSs or multiple file systems, we are running with 1 active MDS and only 1 file system. We are looking forward to the day when snapshots are fully supported.

#6 Updated by Zheng Yan over 6 years ago

Eric Eastman wrote:

The log file with debug_mds=10 from MDS startup to reaching the assert is 110GB. I am attaching the last 50K lines. If you want the whole file, let me know, and I will compress it and put it on a public ftp server.

how large is the log from 'replay start' to first appearance of "EOpen.replay ino xxx not in metablob". could you upload the log.

#7 Updated by Eric Eastman over 6 years ago

The entire log using bzip2 compressed down to 4.6G. You can download it from:
ftp://ftp.keepertech.com/outgoing/eric/ceph_logs/ceph-mds.ede-c2-mon03.log-11Sep17.bz2

#8 Updated by Patrick Donnelly over 6 years ago

  • Assignee set to Zheng Yan

#9 Updated by Zheng Yan over 6 years ago

  • Subject changed from MDS is not getting past up:replay on Luminous cluster to luminous: MDS is not getting past up:replay on Luminous cluster
  • Status changed from New to Fix Under Review
  • Backport set to luminous
  • Affected Versions v12.0.0 added

https://github.com/ceph/ceph/pull/17994

patch is only for luminous. this bug has been fixed in master in another way (change is much bigger)

#10 Updated by Zheng Yan over 6 years ago

  • Release set to luminous

#11 Updated by Nathan Cutler over 6 years ago

  • Tracker changed from Bug to Backport
  • Status changed from Fix Under Review to Resolved
  • Target version set to v12.2.2

#12 Updated by Nathan Cutler over 6 years ago

  • Tracker changed from Backport to Bug
  • % Done set to 0
  • Backport set to luminous
  • Regression set to No
  • Severity set to 3 - minor

Also available in: Atom PDF