Bug #6755
closedmds assert soon after startup (while recovering)
0%
Description
Since I debugged this one a bit, I try to summarize what I could gather. I was in the process of upgrading from 0.64.4 to 0.72. The mon and mds were already upgraded and some of the OSDs. Then the mds crashed, showing a strange error message:
2013-11-12 14:21:08.880249 7f2ae8ef6700 0 mds.0.cache recovery error! -23
2013-11-12 14:21:08.887905 7f2ae8ef6700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7f2ae8ef6700 time 2013-11-12 14:21:08.880291
mds/MDCache.cc: 5808: FAILED assert(0 == "unexpected error from osd during recovery")
After this one, the mds always asserts soon after it is active. From what I can see from the logs, there are files somehow marked "lost" even though the mon cluster is healthy.
mds tries to recover a file:
void MDCache::do_file_recover()
osd processes request:
void ReplicatedPG::do_op(OpRequestRef op)
if ((op->may_read()) && (obc->obs.oi.is_lost()))
osd->reply_op_error(op, -ENFILE);
server sees error:
void MDCache::_recovered(CInode *in, int r, uint64_t size, utime_t mtime)
if (r != 0) {
assert(0 == "unexpected error from osd during recovery");
I attached osd and mds logs for one startup. Look at everything related to inode 100001329e3. From the rest of the logs, I see, that there are many more inodes affected. I suppose that there are only files affected, which were beeing written to, but it is unclear, why they are marked as "lost" since the pg's are all healthy.
Files