Bug #22548
mds: crash during recovery
0%
Description
2017-12-27 23:27:05.919710 7f08483d0700 -1 ** Caught signal (Aborted) *
in thread 7f08483d0700 thread_name:ms_dispatch
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
1: (()+0x508527) [0x56125f945527]
2: (()+0xf890) [0x7f084ddb1890]
3: (gsignal()+0x37) [0x7f084c23c067]
4: (abort()+0x148) [0x7f084c23d448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x56125fa4eda6]
6: (Locker::file_recover(ScatterLock*)+0x1f9) [0x56125f7b6829]
7: (MDCache::start_files_to_recover()+0xbb) [0x56125f6f0c9b]
8: (MDSRank::clientreplay_start()+0x76) [0x56125f654ea6]
9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x1c5a) [0x56125f666e8a]
10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xd16) [0x56125f63f196]
11: (MDSDaemon::handle_core_message(Message*)+0x783) [0x56125f640863]
12: (MDSDaemon::ms_dispatch(Message*)+0xdb) [0x56125f640afb]
13: (DispatchQueue::entry()+0x7ba) [0x56125fb5d0ca]
14: (DispatchQueue::DispatchThread::entry()+0xd) [0x56125fa334cd]
15: (()+0x8064) [0x7f084ddaa064]
16: (clone()+0x6d) [0x7f084c2ef62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
History
#1 Updated by Zheng Yan about 6 years ago
which line trigger the assertion
#2 Updated by wei jin about 6 years ago
Zheng Yan wrote:
which line trigger the assertion
Hi, yan
this line:
0> 2017-12-27 23:27:05.892112 7f08483d0700 -1 mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7f08483d0700 time 2017-12-27 23:27:05.890326
mds/Locker.cc: 4924: FAILED assert(lock->get_state() == LOCK_PRE_SCAN)
#3 Updated by Zheng Yan about 6 years ago
this probably can be fixed by. How many times do you encounter this issue
diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc index bc0dcecb7b..4ff613bc86 100644 --- a/src/mds/MDSRank.cc +++ b/src/mds/MDSRank.cc @@ -1432,8 +1432,8 @@ void MDSRank::rejoin_done() void MDSRank::clientreplay_start() { dout(1) << "clientreplay_start" << dendl; - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters mdcache->start_files_to_recover(); + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters queue_one_replay(); }
#4 Updated by wei jin about 6 years ago
Just once.
It took a little long time during recovery and then crashed. There are about 10M files in the file system.
Is it easy to reproduce it? I can not do it because it is a production environment.
Another question, is standby-replay stable enough to enable? And will it affect performance? I notice the reply latency of mds daemon is already about 10-20 ms. I wonder if it is worthy of enabling it to reduce recovery overhead when failover.
#5 Updated by Patrick Donnelly about 6 years ago
- Subject changed from crash during mds recovery to mds: crash during recovery
- Status changed from New to Need More Info
- Assignee set to Zheng Yan
- Release set to jewel
- Component(FS) MDS added
#6 Updated by Patrick Donnelly about 5 years ago
- Assignee deleted (
Zheng Yan)