Project

General

Profile

Actions

Bug #1775

closed

mds startup: _replay journaler got error -22, aborting, possible regresion?

Added by Szymon Szypulski over 12 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu natty, kernel 3.2-rc2, ceph 0.38 (stable from git) with patch from #1756 and workaround for #1757

setup
s1: mds, osd, mon
s2: mds, osd, mon
s3: mon

In the middle of copying (Sage suggested wiping out cluster - #1757) both mds daemons crashed like showed in logs. It looks similar to #805, #873, but it was fixed.


Files

mds.backup1.log (3.21 KB) mds.backup1.log short log Szymon Szypulski, 12/01/2011 12:24 AM
mds.backup1.log.bz2 (11 MB) mds.backup1.log.bz2 full long (debug mds = 20, debug journaler = 20) Szymon Szypulski, 12/01/2011 12:24 AM
journal.mds0.bz2 (4.2 MB) journal.mds0.bz2 backup0 journal Szymon Szypulski, 12/01/2011 08:41 AM
mds.backup1.log.1.gz (9.33 KB) mds.backup1.log.1.gz Szymon Szypulski, 12/01/2011 09:07 AM
mds.backup2.log.1.gz (339 KB) mds.backup2.log.1.gz Szymon Szypulski, 12/01/2011 09:07 AM
Actions #1

Updated by Sage Weil over 12 years ago

  • Category set to 1
  • Assignee set to Sage Weil
  • Target version set to v0.40

Can you dump the mds journal so we can get a closer look at the corruption? Something like

ceph-mds -i foo --dump-journal 0 /tmp/journal.mds0

Also, did you have any OSD logging enabled at the time of the crash?

Actions #2

Updated by Szymon Szypulski over 12 years ago

No I didn't have osd logging enabled, I'll provide you with journal in few minutes.

Actions #5

Updated by Sage Weil over 12 years ago

stick a

continue;

after the set_read_pos() call to avoid the second crash.

Actions #6

Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info

Without logs, it's hard to say, but it looks like something caused the OSD to drop a write (or series of writes). No msgr failures in the log.

Improving msgr qa coverage will help eliminate that possible cause.

Actions #7

Updated by Sage Weil over 12 years ago

  • Assignee deleted (Sage Weil)
Actions #8

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position set to 108
Actions #9

Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to Resolved

chalking this up to a msgr failure due to one of the zillions of bugs we've fixed in the last few months.

Actions #10

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF