Project

General

Profile

Actions

Bug #4618

closed

Journaler: _is_readable() and _prefetch() don't communicate correctly

Added by Andras Elso about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The Journaler has mechanisms to try and read extra data if an event is large enough that it exceeds the current prefetch, but they don't work properly. _is_readable() is setting the temp_fetch_len member based on how much more data it needs, but _prefetch() is interpreting it from read_pos. Seeing this bug requires having some pretty large Events in the log, and there's an easy workaround (increase the prefetch period config setting), but it's hard to diagnose and a bit embarrassing. ;)

I haven't got any patches to fix this issue, yet.

Original bug:

Our mds replay the journal forever, doesnt want to active.
There are 5 mons, and 5 osd.
max_mds is 2.

'ceph-mds2' mds.0.272 up:resolve seq 12
'ceph-mds1' mds.1.241 up:replay seq 25

The logs are here: http://elbandi.net/ceph/
Version everywhere: 0.56.4

Actions #1

Updated by Ian Colle about 11 years ago

  • Assignee set to Greg Farnum
  • Priority changed from Normal to High
  • Source changed from Development to Community (dev)
Actions #2

Updated by Greg Farnum about 11 years ago

  • Priority changed from High to Normal

I'll check my assumptions today (already downloaded the logs), but with multiple active MDSes this doesn't warrant a "High" right now.

Actions #3

Updated by Andras Elso about 11 years ago

I changed back to max_mds 1. same result:

dumped mdsmap epoch 49925
epoch   49925
flags   0
created 2013-02-27 17:00:00.919407
modified        2013-04-02 21:35:43.060313
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    49925
last_failure_osd_epoch  4410
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0,1
up      {0=20876,1=20735}
failed
stopped
data_pools      [0,5,12]
metadata_pool   1
20876:  x.x.34.82:6800/15681 'ceph-mds1' mds.0.287 up:resolve seq 2
20735:  x.x.34.83:6800/16842 'ceph-mds2' mds.1.252 up:replay seq 1

Actions #4

Updated by Greg Farnum about 11 years ago

Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?

Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.

Actions #5

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?

Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.

the logs are big: first is ~3G, second is ~200M. the config was "debug mds = 20". There was ~200-300k files on cephfs.

I created new logs with "debug ms = 1" and "debug mds = 20", watch the 2_* files on link.

Actions #6

Updated by Greg Farnum about 11 years ago

Sorry, I mean the mds journal, not the debug logs, when referring to the size.

Actions #7

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Sorry, I mean the mds journal, not the debug logs, when referring to the size.

So the mds journal doesnt "commited" to a final state?

What can i do? need you more asserts/debug lines somewhere?

Actions #8

Updated by Greg Farnum about 11 years ago

The MDS maintains a journal that it writes metadata into before committing the aggregated updates into the actual inodes. This journal is what it's "replaying" after a reboot.
The part I'm curious about is that according to your log the MDS which isn't finishing replay has a log of over a gigabyte, which is larger than it should normally be allowed to get (your other MDS has a log of 167 MB, which is a bit long but in the range of possibility). Can you upload or pastebin:
1) The output of ceph -s.
2) The output of your MDS config dump from one of your daemons (see http://ceph.com/docs/master/rados/configuration/ceph-conf/#viewing-a-configuration-at-runtime).

Actions #9

Updated by Andras Elso about 11 years ago

Here is the status: http://pastebin.com/x1XEvuWc
Here is the config dump: http://pastebin.com/YTFbY5jW

Actions #10

Updated by Greg Farnum about 11 years ago

Thanks. (For future onlookers, the summary of those links is that everything is perfectly normal and as it should be, except an MDS is stuck in replay.)

Just a quick check of the messages indicates that the OSDs are giving responses to everything the mds requests, too, so it's not a failure to retrieve disk objects.

Actions #11

Updated by Andras Elso about 11 years ago

if you tell me (here or irc) where to add new debug/assert lines, we can hunt down this bug.

Actions #12

Updated by Sage Weil about 11 years ago

Don't forget #3351.. if the osd returns a short read on an object before the end of the journal, the Journaler replay code gets confused and gets stuck. Every object should be exactly 4MB except the very last one, or else!

Actions #13

Updated by Greg Farnum about 11 years ago

Yeah, that all looks good too. My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim and somehow causes whatever the hang is (which follows a slave update replay).

Actions #14

Updated by Andras Elso about 11 years ago

here is a logcut with "debug journaler = 20": http://pastebin.com/nrzJg87E

Actions #15

Updated by Zheng Yan about 11 years ago

see commit 7e04504d3ed119bb43a4eb99ca524b39dc3696bc. But the bug should just make replay slow.

Actions #16

Updated by Greg Farnum about 11 years ago

Sorry, but I'm a bit lost about why that might apply here. Are you just speculating or did something in the logs look like this issue?
(In particular it sure looks to me like replay has just stopped, but I can check for odd mds messages as well.)

Actions #17

Updated by Sage Weil about 11 years ago

I think of it every time i hear "stuck in replay", that's all. I havne't looked at the logs or anything.

Actions #18

Updated by Zheng Yan about 11 years ago

you said "My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim". that bug can prevent journal from trimming. But can't explain why does journal replay stop.

Actions #19

Updated by Andras Elso about 11 years ago

just a guess: with journaler debug, there is a line:

mds.1.journaler(ro) wait_for_readable at 67811200485 onreadable 0x20fa5e0

I added a debug line to https://github.com/ceph/ceph/blob/master/src/mds/MDLog.cc#L490 , i got this:

mds.1.log _replay_thread read_pos 67811200485, write_pos 68625972210

does this mean anything?

Actions #20

Updated by Greg Farnum about 11 years ago

Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.

Actions #21

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.

Yes. Watch the link, where the others are.

Actions #22

Updated by Greg Farnum about 11 years ago

Okay, there's not a lot there so apparently it doesn't have as much data as it thinks it needs in order to read the next entry. But it's not kicking off any attempts to read more either, which is odd. I'm checking through the prefetch logging more carefully now, but I don't think there should be any issues here since all the actual read requests it issues were answered. Can you apply the patch below and run again (with journaler debugging enabled).

diff --git a/src/osdc/Journaler.cc b/src/osdc/Journaler.cc
index 544ac8c..734c576 100644
--- a/src/osdc/Journaler.cc
+++ b/src/osdc/Journaler.cc
@@ -882,6 +882,10 @@ bool Journaler::_is_readable()
   if (read_buf.length() >= sizeof(s) &&
       read_buf.length() >= sizeof(s) + s) 
     return true;  // yep, next entry is ready.
+  else {
+    dout (0) << "_is_readable read_bug.length() == " << read_buf.length()
+            << ", but need " << s + sizeof(s) << " for next entry" << dendl;
+  }

   // darn it!

@@ -908,6 +912,8 @@ bool Journaler::_is_readable()
             << " for len " << s << " entry" << dendl;
     temp_fetch_len = need;
   }
+
+  dout(0) << "_is_readable: not readable, returning false" << dendl;
   return false;
 }
Actions #23

Updated by Greg Farnum about 11 years ago

  • Status changed from New to In Progress
  • Priority changed from Normal to High
Actions #24

Updated by Greg Farnum about 11 years ago

Okay, so the next entry is >40MB and we have 38MB in our read buffer. I'm not certain, but I think our use of "temp_fetch_len" is just broken. Let's try increasing the "journaler prefetch periods" config member to 20; if that does fix it my supposition is correct and I'll go in more carefully.

Actions #25

Updated by Greg Farnum about 11 years ago

  • Subject changed from mds replay forever to Journaler: _is_readable() and _prefetch() don't communicate correctly
  • Description updated (diff)
  • Category set to 1
  • Status changed from In Progress to 12
Actions #26

Updated by Andras Elso about 11 years ago

Can i continue testing cephfs, or you make the fix quickly for this bug, and i can verify it on my system?

Actions #27

Updated by Greg Farnum about 11 years ago

  • Status changed from 12 to Fix Under Review

There were a couple related bugs which prevented this from working right. I don't guarantee it's bug-free now, but the issue in question here is dealt with. I tested it on a vstart cluster by setting journaler prefetch periods = 1 and mds log segment size = 1024, then restarting the MDS while under load. It worked. Hurray.

Pushed the patch to wip-journaler-4618.

Actions #28

Updated by Greg Farnum about 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #29

Updated by Andras Elso about 11 years ago

Confirmed, i tested with my system, and the journal-check can load the journal.

But, there is a line in commit:

ldout(cct, 0) << "_prefetch" << dendl;

Need this line with 0 debug level?

Actions #30

Updated by Greg Farnum about 11 years ago

Heh, no; that was supposed to be a 10. Re-pushed; thanks!

Actions #31

Updated by Greg Farnum about 11 years ago

  • Assignee changed from Greg Farnum to Sam Lang

Said he could look at this for me today.

Actions #32

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to Resolved

Merged into next in commit:8eb5465c10840d047a894d1a4f079ff8b8d608b5. This would apply to bobtail as well if we decide we want to backport it, but my inclination is to not bother — it's not a likely case and has workarounds.

Actions

Also available in: Atom PDF