Tasks #51341: Steps to recover file system(s) after recovering the Ceph monitor store - CephFS - Ceph

Actions

Copy link

Tasks #51341

open

Steps to recover file system(s) after recovering the Ceph monitor store

Added by Ramana Raja almost 3 years ago. Updated over 2 years ago.

Status:

In Progress

Priority:

High

Assignee:

Category:

Administration/Usability

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Component(FS):

Labels (FS):

multifs, multimds

Pull request ID:

Description

In certain rare cases, all the Ceph Monitors might end up with corrupted Monitor stores. The Monitor stores can be recovered from the OSDs using the monmap tool, and Monitors can be brought back online. MDSMaps however are lost. Additional steps are required to bring back the file system(s) and the MDSs. Steps for recovery will differ between single active MDS and multi active MDS file systems, single and multi file systems in a Ceph cluster

The steps identified to bring back a single active MDS file system post recovery of Monitor stores and Monitor is as follows:

- Ensure all MDSs are stopped on cluster
systemctl stop ceph-mds@<mds-id>

- Force create ceph file system using existing file system pools
ceph fs new <fs-name> <cephfs-metadata-poolname> <cephfs-data-poolname> --force

- Reset file system
ceph fs reset <fs-name> --yes-i-really-mean-it

- Restart MDSs
systemctl start ceph-mds@<mds-id>

For multi MDS file system, it may be possible to recover by marking the new file system unjoinable, setting max_mds, and then marking it joinable.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Ramana Raja almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Ramana Raja almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Ramana Raja almost 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Ramana Raja almost 3 years ago

Status changed from New to In Progress

Steps to recover single active MDS file system https://github.com/ceph/ceph/pull/42295

Actions

Copy link

Updated by Ramana Raja almost 3 years ago

Testing out steps to recover a multiple active MDS file system after recovering monitor store using OSDs:

- Stop all MDSs

- create FSMap with defaults
fs new <fs> <metadata pool> <data pool> --force

- set MDS rank 0 to failed state to read in-RADOS metadata when starting up
fs reset <fs> --yes-i-really-mean-it

- set number of active MDS of recovered file system to 1
fs set <fs> joinable false
fs set <fs> max_mds 1
fs set <fs> joinable true

- Restart MDSs

- See that each MDS gets to rejoin state and dies, and the next MDS does the same. Expected one of the MDSs to become active, other MDSs become standby, and the file system be healthy.

   -23> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.d my gid is 4183
   -22> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.d map says I am mds.0.20 state up:rejoin
   -21> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.d msgr says I am [v2:192.168.0.10:6810/120552584,v1:192.168.0.10:6811/120552584]
   -20> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.d handle_mds_map: handling map as rank 0
   -19> 2021-07-19T17:41:20.203-0400 7f924ed5c640  1 mds.0.20 handle_mds_map i am now mds.0.20
   -18> 2021-07-19T17:41:20.203-0400 7f924ed5c640  1 mds.0.20 handle_mds_map state change up:reconnect --> up:rejoin
   -17> 2021-07-19T17:41:20.203-0400 7f924ed5c640  1 mds.0.20 rejoin_start
   -16> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.cache rejoin_start
   -15> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.cache process_imported_caps
   -14> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.openfiles prefetch_inodes
   -13> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.openfiles _prefetch_inodes state 1
   -12> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.openfiles _prefetch_dirfrags
   -11> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.openfiles _prefetch_inodes state 3
   -10> 2021-07-19T17:41:20.203-0400 7f924ed5c640  7 mds.0.cache trim_non_auth
    -9> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.cache number of subtrees = 12; not printing subtrees
    -8> 2021-07-19T17:41:20.203-0400 7f924ed5c640  1 mds.0.20 rejoin_joint_start
    -7> 2021-07-19T17:41:20.203-0400 7f924ed5c640 10 mds.0.cache rejoin_send_rejoins with recovery_set
    -6> 2021-07-19T17:41:20.217-0400 7f9252d64640  1 -- [v2:192.168.0.10:6810/120552584,v1:192.168.0.10:6811/120552584] <== mon.0 v2:192.168.0.10:40360/0 21 ==== mdsbeacon(4183/d up:rejoin seq=5 v22) v8 ==== 130+0+0 (secure 0 0 0) 0x55781989b600 con 0x557814b83600
    -5> 2021-07-19T17:41:20.217-0400 7f9252d64640  5 mds.beacon.d received beacon reply up:rejoin seq 5 rtt 1.046
    -4> 2021-07-19T17:41:20.220-0400 7f924ed5c640 -1 ../src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7f924ed5c640 time 2021-07-19T17:41:20.204686-0400
../src/mds/MDCache.cc: 4074: FAILED ceph_assert(auth >= 0)

 ceph version 17.0.0-5874-ga0a8ba5087f (a0a8ba5087f0b82588860cda188dfdb48a964771) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x7f925563ce17]
 2: /home/rraja/ceph/build/lib/libceph-common.so.2(+0x1641099) [0x7f925563d099]
 3: (MDCache::rejoin_send_rejoins()+0xd0f) [0x5578119b409b]
 4: (MDSRank::rejoin_joint_start()+0x133) [0x5578117e6827]
 5: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x19a7) [0x5578117eb395]
 6: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0x1d3b) [0x5578117b9209]
 7: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x50d) [0x5578117bb0d7]
 8: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x19a) [0x5578117ba9f4]
 9: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xe9) [0x7f92558236cf]
 10: (DispatchQueue::entry()+0x626) [0x7f9255822116]
 11: (DispatchQueue::DispatchThread::entry()+0x1c) [0x7f92559a3a74]
 12: (Thread::entry_wrapper()+0x83) [0x7f92555fd551]
 13: (Thread::_entry_func(void*)+0x18) [0x7f92555fd4c4]
 14: /lib64/libpthread.so.0(+0x93f9) [0x7f9253c443f9]
 15: clone()

    -3> 2021-07-19T17:41:20.226-0400 7f924c557640 20 mgrc operator() sending 126 counters (of possible 379), 0 new, 0 removed
    -2> 2021-07-19T17:41:20.233-0400 7f924c557640 20 mgrc _send_report encoded 1494 bytes
    -1> 2021-07-19T17:41:20.233-0400 7f924c557640  1 -- [v2:192.168.0.10:6810/120552584,v1:192.168.0.10:6811/120552584] --> [v2:192.168.0.10:6800/1709313,v1:192.168.0.10:6801/1709313] -- mgrreport(unknown.d +0-0 packed 1494 task_status=0) v9 -- 0x557814b71c00 con 0x557814c74480
     0> 2021-07-19T17:41:20.233-0400 7f924ed5c640 -1 *** Caught signal (Aborted) **
 in thread 7f924ed5c640 thread_name:ms_dispatch

 ceph version 17.0.0-5874-ga0a8ba5087f (a0a8ba5087f0b82588860cda188dfdb48a964771) quincy (dev)
 1: /home/rraja/ceph/build/bin/ceph-mds(+0x1216fd2) [0x557811e2bfd2]
 2: /lib64/libpthread.so.0(+0x141e0) [0x7f9253c4f1e0]
 3: gsignal()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x36c) [0x7f925563cfe6]
 6: /home/rraja/ceph/build/lib/libceph-common.so.2(+0x1641099) [0x7f925563d099]
 7: (MDCache::rejoin_send_rejoins()+0xd0f) [0x5578119b409b]
 8: (MDSRank::rejoin_joint_start()+0x133) [0x5578117e6827]
 9: (MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x19a7) [0x5578117eb395]
 10: (MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0x1d3b) [0x5578117b9209]
 11: (MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x50d) [0x5578117bb0d7]
 12: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x19a) [0x5578117ba9f4]
 13: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0xe9) [0x7f92558236cf]
 14: (DispatchQueue::entry()+0x626) [0x7f9255822116]
 15: (DispatchQueue::DispatchThread::entry()+0x1c) [0x7f92559a3a74]
 16: (Thread::entry_wrapper()+0x83) [0x7f92555fd551]
 17: (Thread::_entry_func(void*)+0x18) [0x7f92555fd4c4]
 18: /lib64/libpthread.so.0(+0x93f9) [0x7f9253c443f9]
 19: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

- Also, on Patrick's suggestion, made sure that file system was not mounted before stopping all MDSs, removing the filesystem/FSMap, and testing the above recovery steps. But still end up with the same crash of MDSs.

Actions

Copy link