Project

General

Profile

Actions

Bug #451

closed

mds: replay error

Added by Sage Weil over 13 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


2010-10-01 03:30:59.816017 7f991dcf7710 mds0.journaler try_read_entry
at 822192580 reading 822192580~4 (have 160437)
2010-10-01 03:30:59.816022 7f991dcf7710 mds0.journaler try_read_entry
got 0 len entry at offset 822192580
2010-10-01 03:30:59.816026 7f991dcf7710 mds0.log _replay journaler got
error -22, aborting
2010-10-01 03:30:59.816030 7f991dcf7710 mds0.log _replay_thread kicking waiters
2010-10-01 03:30:59.816034 7f991dcf7710 mds0.18 boot_start encountered
an error, failing
2010-10-01 03:30:59.816038 7f991dcf7710 mds0.18 suicide.  wanted
up:replay, now down:dne
2010-10-01 03:30:59.816047 7f991dcf7710 mds0.cache WARNING: mdcache
shutdown with non-empty cache
2010-10-01 03:30:59.816052 7f991dcf7710 mds0.cache show_subtrees
2010-10-01 03:30:59.816059 7f991dcf7710 mds0.cache |__ 0    auth [dir
1 / [2,head] auth v=327474 cv=0/0 dir_auth=0 state=1610612736 f(v5
m2010-09-30 10:42:48.462366 5=0+5) n(v5922 rc2010-10-01
02:32:18.156019 b129532745664 784=23+761)/n(v5922 rc2010-10-01
02:32:18.156019 b129521095616 784=23+761) hs=1+0,ss=0+0 dirty=1 |
child subtree dirty 0x7f98e8006ae0]
2010-10-01 03:30:59.816078 7f991dcf7710 mds0.cache |__ 0    auth [dir
100 ~mds0/ [2,head] auth v=795 cv=0/0 dir_auth=0 state=1610612736 f(v0
2=1+1) n(v54 rc2010-10-01 02:07:56.696257 b1883380490265 372=365+7)
hs=1+0,ss=0+0 dirty=1 | child subtree dirty 0x7f98e80070f0]
2010-10-01 03:30:59.816094 7f991dcf7710 mds0.log _replay_thread finish
2010-10-01 03:30:59.816129 7f992d903710 mds0.18 handle_mds_beacon
up:replay seq 3 rtt 7.483578
2010-10-01 03:30:59.816571 7f9930405720 7f9930405720 stopped.

Henry, can you post the full mds log somewhere? I'm curious what the mds log/jouranl bounds are in the header, and how close we are to that.

IIRC at one point I was concerned about the journal replay's ability to detect a incomplete write at the end of a journal when an entry crosses an object boundary. Not sure if that's what's going on here...

Actions #2

Updated by Sage Weil over 13 years ago

Henry Chang wrote:

I put the full log on:

http://veqrya.bay.livefilestore.com/y1p7N29j9-l0ihB4N3FXU2n9__9LyGY8Keb_yXs1KQFQD4zRyGPRL8GTukp2hI2nHFxy92YRBcT3hru1_XxsdY7wIzmc8tKiDd-/mds.1.log.gz?download&psid=1

Let me know if you cannot access it. Thanks.

Hmm... yeah the link doesn't work :(

You could also stick it on your gateway machine and I can grab it from there?

Actions #3

Updated by Henry Chang over 13 years ago

OK.. I've put it on the gateway machine: /tmp/ceph_logs/mds.1.log.gz

Actions #4

Updated by Sage Weil over 13 years ago

Henry Chang wrote:

OK.. I've put it on the gateway machine: /tmp/ceph_logs/mds.1.log.gz

Got it, thanks.

Okay, it looks it's near the end of the mds log, so most likely we can just declare end-of-journal at that point instead of throwing an error and it'll recovery, but I want to figure out how/why it got into this situation. This isn't the asgc cluster I take it? Is it one I can access?

I'd like to look at the actual contents of the mds journal to see where the zeros came from. You can grab individual objects with the 'rados' pool (rados -p metadata get 200.0000000c4 /tmp/foo or similar). If you add 'debug ms = 1' to the [mds] section and restart mds the log will show which objects it's reading from. Also 'debug journaler = 20' might be helpful.

Actions #5

Updated by Henry Chang over 13 years ago

Uhh...Sorry, I thought the log should be enough, so I re-deployed the cluster and destroyed everything...

Actions #6

Updated by Sage Weil over 13 years ago

  • Status changed from New to Can't reproduce
Actions #7

Updated by Henry Chang over 13 years ago

<removed sensitive info>

Actions #8

Updated by Sage Weil over 13 years ago

  • Status changed from Can't reproduce to In Progress
  • Assignee set to Sage Weil
  • Target version set to v0.24
Actions #9

Updated by Sage Weil over 13 years ago

  • Status changed from In Progress to Closed
  • Target version deleted (v0.24)

It look like the cluster is running pretty old code (0.22~rc). This particular problem was fixed by 1c934ebd (0.23).

Actions #10

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF