Bug #451: mds: replay error - CephFS - Ceph

Actions

Copy link

Bug #451

closed

mds: replay error

Added by Sage Weil over 13 years ago. Updated over 7 years ago.

Status:

Closed

Priority:

Urgent

Assignee:

Sage Weil

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description


2010-10-01 03:30:59.816017 7f991dcf7710 mds0.journaler try_read_entry
at 822192580 reading 822192580~4 (have 160437)
2010-10-01 03:30:59.816022 7f991dcf7710 mds0.journaler try_read_entry
got 0 len entry at offset 822192580
2010-10-01 03:30:59.816026 7f991dcf7710 mds0.log _replay journaler got
error -22, aborting
2010-10-01 03:30:59.816030 7f991dcf7710 mds0.log _replay_thread kicking waiters
2010-10-01 03:30:59.816034 7f991dcf7710 mds0.18 boot_start encountered
an error, failing
2010-10-01 03:30:59.816038 7f991dcf7710 mds0.18 suicide.  wanted
up:replay, now down:dne
2010-10-01 03:30:59.816047 7f991dcf7710 mds0.cache WARNING: mdcache
shutdown with non-empty cache
2010-10-01 03:30:59.816052 7f991dcf7710 mds0.cache show_subtrees
2010-10-01 03:30:59.816059 7f991dcf7710 mds0.cache |__ 0    auth [dir
1 / [2,head] auth v=327474 cv=0/0 dir_auth=0 state=1610612736 f(v5
m2010-09-30 10:42:48.462366 5=0+5) n(v5922 rc2010-10-01
02:32:18.156019 b129532745664 784=23+761)/n(v5922 rc2010-10-01
02:32:18.156019 b129521095616 784=23+761) hs=1+0,ss=0+0 dirty=1 |
child subtree dirty 0x7f98e8006ae0]
2010-10-01 03:30:59.816078 7f991dcf7710 mds0.cache |__ 0    auth [dir
100 ~mds0/ [2,head] auth v=795 cv=0/0 dir_auth=0 state=1610612736 f(v0
2=1+1) n(v54 rc2010-10-01 02:07:56.696257 b1883380490265 372=365+7)
hs=1+0,ss=0+0 dirty=1 | child subtree dirty 0x7f98e80070f0]
2010-10-01 03:30:59.816094 7f991dcf7710 mds0.log _replay_thread finish
2010-10-01 03:30:59.816129 7f992d903710 mds0.18 handle_mds_beacon
up:replay seq 3 rtt 7.483578
2010-10-01 03:30:59.816571 7f9930405720 7f9930405720 stopped.

Henry, can you post the full mds log somewhere? I'm curious what the mds log/jouranl bounds are in the header, and how close we are to that.

IIRC at one point I was concerned about the journal replay's ability to detect a incomplete write at the end of a journal when an entry crosses an object boundary. Not sure if that's what's going on here...

Actions

Copy link

Updated by Henry Chang over 13 years ago

I put the full log on:

http://veqrya.bay.livefilestore.com/y1p7N29j9-l0ihB4N3FXU2n9__9LyGY8Keb_yXs1KQFQD4zRyGPRL8GTukp2hI2nHFxy92YRBcT3hru1_XxsdY7wIzmc8tKiDd-/mds.1.log.gz?download&psid=1

Let me know if you cannot access it. Thanks.

Actions

Copy link

Updated by Sage Weil over 13 years ago

Henry Chang wrote:

I put the full log on:

http://veqrya.bay.livefilestore.com/y1p7N29j9-l0ihB4N3FXU2n9__9LyGY8Keb_yXs1KQFQD4zRyGPRL8GTukp2hI2nHFxy92YRBcT3hru1_XxsdY7wIzmc8tKiDd-/mds.1.log.gz?download&psid=1

Let me know if you cannot access it. Thanks.

Hmm... yeah the link doesn't work :(

You could also stick it on your gateway machine and I can grab it from there?

Actions

Copy link

Updated by Henry Chang over 13 years ago

OK.. I've put it on the gateway machine: /tmp/ceph_logs/mds.1.log.gz

Actions

Copy link

Updated by Sage Weil over 13 years ago

Henry Chang wrote:

OK.. I've put it on the gateway machine: /tmp/ceph_logs/mds.1.log.gz

Got it, thanks.

Okay, it looks it's near the end of the mds log, so most likely we can just declare end-of-journal at that point instead of throwing an error and it'll recovery, but I want to figure out how/why it got into this situation. This isn't the asgc cluster I take it? Is it one I can access?

I'd like to look at the actual contents of the mds journal to see where the zeros came from. You can grab individual objects with the 'rados' pool (rados -p metadata get 200.0000000c4 /tmp/foo or similar). If you add 'debug ms = 1' to the [mds] section and restart mds the log will show which objects it's reading from. Also 'debug journaler = 20' might be helpful.

Actions

Copy link