Project

General

Profile

Actions

Bug #6755

closed

mds assert soon after startup (while recovering)

Added by Markus Blank-Burian over 10 years ago. Updated over 10 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Since I debugged this one a bit, I try to summarize what I could gather. I was in the process of upgrading from 0.64.4 to 0.72. The mon and mds were already upgraded and some of the OSDs. Then the mds crashed, showing a strange error message:

2013-11-12 14:21:08.880249 7f2ae8ef6700 0 mds.0.cache recovery error! -23
2013-11-12 14:21:08.887905 7f2ae8ef6700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7f2ae8ef6700 time 2013-11-12 14:21:08.880291
mds/MDCache.cc: 5808: FAILED assert(0 == "unexpected error from osd during recovery")

After this one, the mds always asserts soon after it is active. From what I can see from the logs, there are files somehow marked "lost" even though the mon cluster is healthy.

mds tries to recover a file:
void MDCache::do_file_recover()

osd processes request:
void ReplicatedPG::do_op(OpRequestRef op)
if ((op->may_read()) && (obc->obs.oi.is_lost()))
osd->reply_op_error(op, -ENFILE);

server sees error:
void MDCache::_recovered(CInode *in, int r, uint64_t size, utime_t mtime)
if (r != 0) {
assert(0 == "unexpected error from osd during recovery");

I attached osd and mds logs for one startup. Look at everything related to inode 100001329e3. From the rest of the logs, I see, that there are many more inodes affected. I suppose that there are only files affected, which were beeing written to, but it is unclear, why they are marked as "lost" since the pg's are all healthy.


Files

cephbug.tar.bz2 (839 KB) cephbug.tar.bz2 Markus Blank-Burian, 11/12/2013 10:39 AM
Actions #1

Updated by Markus Blank-Burian over 10 years ago

.. upgrading from 0.67.4 to 0.72 ..

Actions #2

Updated by Zheng Yan over 10 years ago

I think the mds will function after deleting all lost objects

Actions #3

Updated by Zheng Yan over 10 years ago

  • Status changed from New to Duplicate

dup 6761

Actions

Also available in: Atom PDF