Project

General

Profile

Bug #1775

mds startup: _replay journaler got error -22, aborting, possible regresion?

Added by Szymon Szypulski over 12 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu natty, kernel 3.2-rc2, ceph 0.38 (stable from git) with patch from #1756 and workaround for #1757

setup
s1: mds, osd, mon
s2: mds, osd, mon
s3: mon

In the middle of copying (Sage suggested wiping out cluster - #1757) both mds daemons crashed like showed in logs. It looks similar to #805, #873, but it was fixed.

mds.backup1.log View - short log (3.21 KB) Szymon Szypulski, 12/01/2011 12:24 AM

mds.backup1.log.bz2 - full long (debug mds = 20, debug journaler = 20) (11 MB) Szymon Szypulski, 12/01/2011 12:24 AM

journal.mds0.bz2 - backup0 journal (4.2 MB) Szymon Szypulski, 12/01/2011 08:41 AM

mds.backup1.log.1.gz (9.33 KB) Szymon Szypulski, 12/01/2011 09:07 AM

mds.backup2.log.1.gz (339 KB) Szymon Szypulski, 12/01/2011 09:07 AM

History

#1 Updated by Sage Weil over 12 years ago

  • Category set to 1
  • Assignee set to Sage Weil
  • Target version set to v0.40

Can you dump the mds journal so we can get a closer look at the corruption? Something like

ceph-mds -i foo --dump-journal 0 /tmp/journal.mds0

Also, did you have any OSD logging enabled at the time of the crash?

#2 Updated by Szymon Szypulski over 12 years ago

No I didn't have osd logging enabled, I'll provide you with journal in few minutes.

#5 Updated by Sage Weil over 12 years ago

stick a

continue;

after the set_read_pos() call to avoid the second crash.

#6 Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info

Without logs, it's hard to say, but it looks like something caused the OSD to drop a write (or series of writes). No msgr failures in the log.

Improving msgr qa coverage will help eliminate that possible cause.

#7 Updated by Sage Weil over 12 years ago

  • Assignee deleted (Sage Weil)

#8 Updated by Sage Weil about 12 years ago

  • Target version deleted (v0.40)
  • translation missing: en.field_position set to 108

#9 Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to Resolved

chalking this up to a msgr failure due to one of the zillions of bugs we've fixed in the last few months.

#10 Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Also available in: Atom PDF