Bug #1775: mds startup: _replay journaler got error -22, aborting, possible regresion? - CephFS - Ceph

Actions

Copy link

Bug #1775

closed

mds startup: _replay journaler got error -22, aborting, possible regresion?

Added by Szymon Szypulski over 12 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu natty, kernel 3.2-rc2, ceph 0.38 (stable from git) with patch from #1756 and workaround for #1757

setup
s1: mds, osd, mon
s2: mds, osd, mon
s3: mon

In the middle of copying (Sage suggested wiping out cluster - #1757) both mds daemons crashed like showed in logs. It looks similar to #805, #873, but it was fixed.

Files

Download all files

mds.backup1.log (3.21 KB) mds.backup1.log	short log	Szymon Szypulski, 12/01/2011 12:24 AM
mds.backup1.log.bz2 (11 MB) mds.backup1.log.bz2	full long (debug mds = 20, debug journaler = 20)	Szymon Szypulski, 12/01/2011 12:24 AM
journal.mds0.bz2 (4.2 MB) journal.mds0.bz2	backup0 journal	Szymon Szypulski, 12/01/2011 08:41 AM
mds.backup1.log.1.gz (9.33 KB) mds.backup1.log.1.gz		Szymon Szypulski, 12/01/2011 09:07 AM
mds.backup2.log.1.gz (339 KB) mds.backup2.log.1.gz		Szymon Szypulski, 12/01/2011 09:07 AM

Actions

Copy link

Updated by Sage Weil over 12 years ago

Category set to 1
Assignee set to Sage Weil
Target version set to v0.40

Can you dump the mds journal so we can get a closer look at the corruption? Something like

ceph-mds -i foo --dump-journal 0 /tmp/journal.mds0

Also, did you have any OSD logging enabled at the time of the crash?

Actions

Copy link

Updated by Szymon Szypulski over 12 years ago

No I didn't have osd logging enabled, I'll provide you with journal in few minutes.

Actions

Copy link

Updated by Szymon Szypulski over 12 years ago

File journal.mds0.bz2 journal.mds0.bz2 added

Actions

Copy link Download all files

Updated by Szymon Szypulski over 12 years ago

File mds.backup1.log.1.gz mds.backup1.log.1.gz added
File mds.backup2.log.1.gz mds.backup2.log.1.gz added

Actions

Copy link

Updated by Sage Weil over 12 years ago

stick a

continue;

after the set_read_pos() call to avoid the second crash.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from New to Need More Info

Without logs, it's hard to say, but it looks like something caused the OSD to drop a write (or series of writes). No msgr failures in the log.

Improving msgr qa coverage will help eliminate that possible cause.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Assignee deleted (~~Sage Weil~~)

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version deleted (~~v0.40~~)
Translation missing: en.field_position set to 108

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from Need More Info to Resolved

chalking this up to a msgr failure due to one of the zillions of bugs we've fixed in the last few months.

Actions

Copy link

#10

Updated by John Spray over 7 years ago

Project changed from Ceph to CephFS
Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #1775

mds startup: _replay journaler got error -22, aborting, possible regresion?

Updated by Sage Weil over 12 years ago

Updated by Szymon Szypulski over 12 years ago

Updated by Szymon Szypulski over 12 years ago

Updated by Szymon Szypulski over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 11 years ago

Updated by John Spray over 7 years ago