Actions
Feature #8690
openMDS: Allow some kind of recovery when pools are deleted out from underneath us
Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:
Description
I had a secondary (cache) pool once connected to CephFS directory as follows:
ceph mds add_data_pool data_cache cephfs /mnt/ceph/cached set_layout --pool {NN}
Pool got badly broken and due to all those uncatched OSD crashes the only viable option was to delete it in order to keep cluster operational. Prior to pool removal I've given the following command:
ceph mds remove_data_pool data_cache
Now when cluster is healthy MDSes are in endless crash-replay-active loop staying active only for several seconds before yet another crash:
mds/CInode.cc: In function 'virtual void C_Inode_StoredBacktrace::finish(int)' thread 7f4788d5c700 time 2014-06-28 14:08:29.922256 mds/CInode.cc: 1041: FAILED assert(r == 0) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (()+0x3a5725) [0x7f478e2e5725] 2: (Context::complete(int)+0x9) [0x7f478e0c65c9] 3: (C_Gather::sub_finish(Context*, int)+0x217) [0x7f478e0c7dc7] 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x7f478e0c7f02] 5: (Context::complete(int)+0x9) [0x7f478e0c65c9] 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f478e34aa5e] 7: (MDS::handle_core_message(Message*)+0xb3f) [0x7f478e0e8b7f] 8: (MDS::_dispatch(Message*)+0x32) [0x7f478e0e8d72] 9: (MDS::ms_dispatch(Message*)+0xab) [0x7f478e0ea75b] 10: (DispatchQueue::entry()+0x58a) [0x7f478e51fc1a] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f478e43c07d] 12: (()+0x80ca) [0x7f478d8960ca] 13: (clone()+0x6d) [0x7f478c20bffd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
See full log attached (all cluster components are 0.80.1).
At the moment I'm trying to remove all files that are still visible under "/mnt/ceph/cached" but it goes very slowly as only few dozen files get removed before yet another MDS crash.
Files
Actions