Project

General

Profile

Actions

Feature #8690

open

MDS: Allow some kind of recovery when pools are deleted out from underneath us

Added by Dmitry Smirnov almost 10 years ago. Updated almost 8 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

I had a secondary (cache) pool once connected to CephFS directory as follows:

ceph mds add_data_pool data_cache
cephfs /mnt/ceph/cached set_layout --pool {NN}

Pool got badly broken and due to all those uncatched OSD crashes the only viable option was to delete it in order to keep cluster operational. Prior to pool removal I've given the following command:

ceph mds remove_data_pool data_cache

Now when cluster is healthy MDSes are in endless crash-replay-active loop staying active only for several seconds before yet another crash:

mds/CInode.cc: In function 'virtual void C_Inode_StoredBacktrace::finish(int)' thread 7f4788d5c700 time 2014-06-28 14:08:29.922256
mds/CInode.cc: 1041: FAILED assert(r == 0)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x3a5725) [0x7f478e2e5725]
 2: (Context::complete(int)+0x9) [0x7f478e0c65c9]
 3: (C_Gather::sub_finish(Context*, int)+0x217) [0x7f478e0c7dc7]
 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x7f478e0c7f02]
 5: (Context::complete(int)+0x9) [0x7f478e0c65c9]
 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f478e34aa5e]
 7: (MDS::handle_core_message(Message*)+0xb3f) [0x7f478e0e8b7f]
 8: (MDS::_dispatch(Message*)+0x32) [0x7f478e0e8d72]
 9: (MDS::ms_dispatch(Message*)+0xab) [0x7f478e0ea75b]
 10: (DispatchQueue::entry()+0x58a) [0x7f478e51fc1a]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f478e43c07d]
 12: (()+0x80ca) [0x7f478d8960ca]
 13: (clone()+0x6d) [0x7f478c20bffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

See full log attached (all cluster components are 0.80.1).

At the moment I'm trying to remove all files that are still visible under "/mnt/ceph/cached" but it goes very slowly as only few dozen files get removed before yet another MDS crash.


Files

ceph-mds.debstor.log.xz (120 KB) ceph-mds.debstor.log.xz Dmitry Smirnov, 06/27/2014 09:35 PM
Actions #1

Updated by Dmitry Smirnov almost 10 years ago

CephFS was practically unusable until I applied the following patch to MDS:

Description: prevent crash with directory mapped to removed pool.
     0> -1 mds/CInode.cc: In function 'virtual void C_Inode_StoredBacktrace::finish(int)'
 mds/CInode.cc: 1041: FAILED assert(r == 0)

--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -1037,9 +1037,9 @@
   version_t version;
   Context *fin;
   C_Inode_StoredBacktrace(CInode *i, version_t v, Context *f) : in(i), version(v), fin(f) {}
   void finish(int r) {
-    assert(r == 0);
+    //assert(r == 0);
     in->_stored_backtrace(version, fin);
   }
 };

I think I wouldn't have to do it if MDS were able to recognise such situation of if there were any options to drop old directory. As general notice I wish all asserts like this to be catched in order to avoid very unpleasant and time consuming crashes. Perhaps a command line option could be added to start MDS in "maintenance" mode?

Actions #2

Updated by Sage Weil almost 10 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Priority changed from High to Urgent
Actions #3

Updated by John Spray almost 10 years ago

Hmm, so to recover from this case I guess we could catch the case where we're writing to a data pool no longer exists, and re-write the layout for affected inodes to refer to the primary data pool so that the MDS has somewhere to write the backtrace? Or we could avoid modifying the layout at all and just transparently redirect all I/O to non-existent pools to the primary data pool in order to limp along long enough for user to delete the affected directories.

Actions #4

Updated by Greg Farnum almost 10 years ago

  • Priority changed from Urgent to Normal

Except that's not really sufficient; we'd need to identify it as a non-existent pool and deal with cases where the pool somehow reappears later on. Recovering from this case seems to me to require some kind of repair, and I'm not sure how we could best go about it.

Did this also involve the "data_cache" pool being a cache pool? Because I don't even know what would have happened given that.

Mostly, I think having the MDS crash here is the right thing (it's even worse than "I lost some of my data", it's "you took away my hard drive!"), but it does illustrate that we might want to have some other controls on removing data pools. Perhaps we could add per-pool information to the rstats (eww) and use that to prevent removal pools that still have data in them? Although that would get awkward under other sorts of failures, and reading the inode data from the monitor isn't very feasible.

If we want to do something, I think the extent of what's reasonable here is some kind of startup mode or config option that doesn't even try to write any updates to pools which aren't members of the data pools list. :/

Actions #5

Updated by Dmitry Smirnov almost 10 years ago

Yes, "data_cache" was a tiered cache pool but EC pool behind it was dropped as well.

IMHO recovery shouldn't bee too sophisticated or intelligent.
Most importantly we should allow cluster administrator to recover without patching and rebuilding Ceph from sources.
It might be sufficient to introduce option "--skip-missing-pools" to MDS to let admin (who presumably knows if pool was dropped intentionally) to start MDS and delete corresponding files.

I can say that I was able to start unpatched MDSes once I removed all files associated with removed pool.
When I started patched MDS I could access CephFS and see all directories and files from removed pool -- it was very convenient to know exactly what was lost. (Of course all those files were empty).

It was crucial to restore access to CephFS because there were unaffected files on replicated "data" pool.

Actions #6

Updated by Greg Farnum almost 10 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from MDS_0.80.1 crash (removed pool): mds/CInode.cc: 1041: FAILED assert(r == 0) to MDS: Allow some kind of recovery when pools are deleted out from underneath us
Actions #7

Updated by Greg Farnum almost 8 years ago

  • Category set to Correctness/Safety
Actions

Also available in: Atom PDF