Bug #6755
closedmds assert soon after startup (while recovering)
0%
Description
Since I debugged this one a bit, I try to summarize what I could gather. I was in the process of upgrading from 0.64.4 to 0.72. The mon and mds were already upgraded and some of the OSDs. Then the mds crashed, showing a strange error message:
2013-11-12 14:21:08.880249 7f2ae8ef6700 0 mds.0.cache recovery error! -23
2013-11-12 14:21:08.887905 7f2ae8ef6700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7f2ae8ef6700 time 2013-11-12 14:21:08.880291
mds/MDCache.cc: 5808: FAILED assert(0 == "unexpected error from osd during recovery")
After this one, the mds always asserts soon after it is active. From what I can see from the logs, there are files somehow marked "lost" even though the mon cluster is healthy.
mds tries to recover a file:
void MDCache::do_file_recover()
osd processes request:
void ReplicatedPG::do_op(OpRequestRef op)
if ((op->may_read()) && (obc->obs.oi.is_lost()))
osd->reply_op_error(op, -ENFILE);
server sees error:
void MDCache::_recovered(CInode *in, int r, uint64_t size, utime_t mtime)
if (r != 0) {
assert(0 == "unexpected error from osd during recovery");
I attached osd and mds logs for one startup. Look at everything related to inode 100001329e3. From the rest of the logs, I see, that there are many more inodes affected. I suppose that there are only files affected, which were beeing written to, but it is unclear, why they are marked as "lost" since the pg's are all healthy.
Files
Updated by Markus Blank-Burian over 10 years ago
.. upgrading from 0.67.4 to 0.72 ..
Updated by Zheng Yan over 10 years ago
I think the mds will function after deleting all lost objects