Bug #6755: mds assert soon after startup (while recovering) - Ceph - Ceph

Actions

Copy link

Bug #6755

closed

mds assert soon after startup (while recovering)

Added by Markus Blank-Burian over 10 years ago. Updated over 10 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since I debugged this one a bit, I try to summarize what I could gather. I was in the process of upgrading from 0.64.4 to 0.72. The mon and mds were already upgraded and some of the OSDs. Then the mds crashed, showing a strange error message:

2013-11-12 14:21:08.880249 7f2ae8ef6700 0 mds.0.cache recovery error! -23
2013-11-12 14:21:08.887905 7f2ae8ef6700 -1 mds/MDCache.cc: In function 'void MDCache::_recovered(CInode*, int, uint64_t, utime_t)' thread 7f2ae8ef6700 time 2013-11-12 14:21:08.880291
mds/MDCache.cc: 5808: FAILED assert(0 == "unexpected error from osd during recovery")

After this one, the mds always asserts soon after it is active. From what I can see from the logs, there are files somehow marked "lost" even though the mon cluster is healthy.

mds tries to recover a file:
void MDCache::do_file_recover()

osd processes request:
void ReplicatedPG::do_op(OpRequestRef op)
if ((op->may_read()) && (obc->obs.oi.is_lost()))
osd->reply_op_error(op, -ENFILE);

server sees error:
void MDCache::_recovered(CInode *in, int r, uint64_t size, utime_t mtime)
if (r != 0) {
assert(0 == "unexpected error from osd during recovery");

I attached osd and mds logs for one startup. Look at everything related to inode 100001329e3. From the rest of the logs, I see, that there are many more inodes affected. I suppose that there are only files affected, which were beeing written to, but it is unclear, why they are marked as "lost" since the pg's are all healthy.

Files

cephbug.tar.bz2 (839 KB) cephbug.tar.bz2

Markus Blank-Burian, 11/12/2013 10:39 AM