Project

General

Profile

Bug #22548

mds: crash during recovery

Added by wei jin about 6 years ago. Updated about 5 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-12-27 23:27:05.919710 7f08483d0700 -1 ** Caught signal (Aborted) *
in thread 7f08483d0700 thread_name:ms_dispatch

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (()+0x508527) [0x56125f945527]
2: (()+0xf890) [0x7f084ddb1890]
3: (gsignal()+0x37) [0x7f084c23c067]
4: (abort()+0x148) [0x7f084c23d448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x56125fa4eda6]
6: (Locker::file_recover(ScatterLock*)+0x1f9) [0x56125f7b6829]
7: (MDCache::start_files_to_recover()+0xbb) [0x56125f6f0c9b]
8: (MDSRank::clientreplay_start()+0x76) [0x56125f654ea6]
9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x1c5a) [0x56125f666e8a]
10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xd16) [0x56125f63f196]
11: (MDSDaemon::handle_core_message(Message*)+0x783) [0x56125f640863]
12: (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x56125f640afb]
13: (DispatchQueue::entry()+0x7ba) [0x56125fb5d0ca]
14: (DispatchQueue::DispatchThread::entry()+0xd) [0x56125fa334cd]
15: (()+0x8064) [0x7f084ddaa064]
16: (clone()+0x6d) [0x7f084c2ef62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Zheng Yan about 6 years ago

which line trigger the assertion

#2 Updated by wei jin about 6 years ago

Zheng Yan wrote:

which line trigger the assertion

Hi, yan

this line:
0> 2017-12-27 23:27:05.892112 7f08483d0700 -1 mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7f08483d0700 time 2017-12-27 23:27:05.890326
mds/Locker.cc: 4924: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)

#3 Updated by Zheng Yan about 6 years ago

this probably can be fixed by. How many times do you encounter this issue

diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc
index bc0dcecb7b..4ff613bc86 100644
--- a/src/mds/MDSRank.cc
+++ b/src/mds/MDSRank.cc
@@ -1432,8 +1432,8 @@ void MDSRank::rejoin_done()
 void MDSRank::clientreplay_start()
 {
   dout(1) << "clientreplay_start" << dendl;
-  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   mdcache->start_files_to_recover();
+  finish_contexts(g_ceph_context, waiting_for_replay);  // kick waiters
   queue_one_replay();
 }

#4 Updated by wei jin about 6 years ago

Just once.

It took a little long time during recovery and then crashed. There are about 10M files in the file system.
Is it easy to reproduce it? I can not do it because it is a production environment.

Another question, is standby-replay stable enough to enable? And will it affect performance? I notice the reply latency of mds daemon is already about 10-20 ms. I wonder if it is worthy of enabling it to reduce recovery overhead when failover.

#5 Updated by Patrick Donnelly about 6 years ago

  • Subject changed from crash during mds recovery to mds: crash during recovery
  • Status changed from New to Need More Info
  • Assignee set to Zheng Yan
  • Release set to jewel
  • Component(FS) MDS added

#6 Updated by Patrick Donnelly about 5 years ago

  • Assignee deleted (Zheng Yan)

Also available in: Atom PDF