Bug #4618: Journaler: _is_readable() and _prefetch() don't communicate correctly - CephFS - Ceph

Actions

Copy link

Bug #4618

closed

Journaler: _is_readable() and _prefetch() don't communicate correctly

Added by Andras Elso about 11 years ago. Updated about 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Sam Lang

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The Journaler has mechanisms to try and read extra data if an event is large enough that it exceeds the current prefetch, but they don't work properly. _is_readable() is setting the temp_fetch_len member based on how much more data it needs, but _prefetch() is interpreting it from read_pos. Seeing this bug requires having some pretty large Events in the log, and there's an easy workaround (increase the prefetch period config setting), but it's hard to diagnose and a bit embarrassing. ;)

I haven't got any patches to fix this issue, yet.

Original bug:

Our mds replay the journal forever, doesnt want to active.
There are 5 mons, and 5 osd.
max_mds is 2.

'ceph-mds2' mds.0.272 up:resolve seq 12
'ceph-mds1' mds.1.241 up:replay seq 25

The logs are here: http://elbandi.net/ceph/
Version everywhere: 0.56.4

Actions

Copy link

Updated by Ian Colle about 11 years ago

Assignee set to Greg Farnum
Priority changed from Normal to High
Source changed from Development to Community (dev)

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Priority changed from High to Normal

I'll check my assumptions today (already downloaded the logs), but with multiple active MDSes this doesn't warrant a "High" right now.

Actions

Copy link

Updated by Andras Elso about 11 years ago

I changed back to max_mds 1. same result:

dumped mdsmap epoch 49925
epoch   49925
flags   0
created 2013-02-27 17:00:00.919407
modified        2013-04-02 21:35:43.060313
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    49925
last_failure_osd_epoch  4410
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0,1
up      {0=20876,1=20735}
failed
stopped
data_pools      [0,5,12]
metadata_pool   1
20876:  x.x.34.82:6800/15681 'ceph-mds1' mds.0.287 up:resolve seq 2
20735:  x.x.34.83:6800/16842 'ceph-mds2' mds.1.252 up:replay seq 1

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?

Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.

Actions

Copy link

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?

Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.

the logs are big: first is ~3G, second is ~200M. the config was "debug mds = 20". There was ~200-300k files on cephfs.

I created new logs with "debug ms = 1" and "debug mds = 20", watch the 2_* files on link.

Actions

Copy link

Updated by Greg Farnum about 11 years ago

Sorry, I mean the mds journal, not the debug logs, when referring to the size.

Actions

Copy link

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Sorry, I mean the mds journal, not the debug logs, when referring to the size.

So the mds journal doesnt "commited" to a final state?

What can i do? need you more asserts/debug lines somewhere?

Actions

Copy link

Updated by Greg Farnum about 11 years ago

The MDS maintains a journal that it writes metadata into before committing the aggregated updates into the actual inodes. This journal is what it's "replaying" after a reboot.
The part I'm curious about is that according to your log the MDS which isn't finishing replay has a log of over a gigabyte, which is larger than it should normally be allowed to get (your other MDS has a log of 167 MB, which is a bit long but in the range of possibility). Can you upload or pastebin:
1) The output of ceph -s.
2) The output of your MDS config dump from one of your daemons (see http://ceph.com/docs/master/rados/configuration/ceph-conf/#viewing-a-configuration-at-runtime).

Actions

Copy link

Updated by Andras Elso about 11 years ago

Here is the status: http://pastebin.com/x1XEvuWc
Here is the config dump: http://pastebin.com/YTFbY5jW

Actions

Copy link

#10

Updated by Greg Farnum about 11 years ago

Thanks. (For future onlookers, the summary of those links is that everything is perfectly normal and as it should be, except an MDS is stuck in replay.)

Just a quick check of the messages indicates that the OSDs are giving responses to everything the mds requests, too, so it's not a failure to retrieve disk objects.

Actions

Copy link

#11

Updated by Andras Elso about 11 years ago

if you tell me (here or irc) where to add new debug/assert lines, we can hunt down this bug.

Actions

Copy link

#12

Updated by Sage Weil about 11 years ago

Don't forget #3351.. if the osd returns a short read on an object before the end of the journal, the Journaler replay code gets confused and gets stuck. Every object should be exactly 4MB except the very last one, or else!

Actions

Copy link

#13

Updated by Greg Farnum about 11 years ago

Yeah, that all looks good too. My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim and somehow causes whatever the hang is (which follows a slave update replay).

Actions

Copy link

#14

Updated by Andras Elso about 11 years ago

here is a logcut with "debug journaler = 20": http://pastebin.com/nrzJg87E

Actions

Copy link

#15

Updated by Zheng Yan about 11 years ago

see commit 7e04504d3ed119bb43a4eb99ca524b39dc3696bc. But the bug should just make replay slow.

Actions

Copy link

#16

Updated by Greg Farnum about 11 years ago

Sorry, but I'm a bit lost about why that might apply here. Are you just speculating or did something in the logs look like this issue?
(In particular it sure looks to me like replay has just stopped, but I can check for odd mds messages as well.)

Actions

Copy link

#17

Updated by Sage Weil about 11 years ago

I think of it every time i hear "stuck in replay", that's all. I havne't looked at the logs or anything.

Actions

Copy link

#18

Updated by Zheng Yan about 11 years ago

you said "My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim". that bug can prevent journal from trimming. But can't explain why does journal replay stop.

Actions

Copy link

#19

Updated by Andras Elso about 11 years ago

just a guess: with journaler debug, there is a line:

mds.1.journaler(ro) wait_for_readable at 67811200485 onreadable 0x20fa5e0

I added a debug line to https://github.com/ceph/ceph/blob/master/src/mds/MDLog.cc#L490 , i got this:

mds.1.log _replay_thread read_pos 67811200485, write_pos 68625972210

does this mean anything?

Actions

Copy link

#20

Updated by Greg Farnum about 11 years ago

Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.

Actions

Copy link

#21

Updated by Andras Elso about 11 years ago

Greg Farnum wrote:

Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.

Yes. Watch the link, where the others are.

Actions

Copy link

#22

Updated by Greg Farnum about 11 years ago

Okay, there's not a lot there so apparently it doesn't have as much data as it thinks it needs in order to read the next entry. But it's not kicking off any attempts to read more either, which is odd. I'm checking through the prefetch logging more carefully now, but I don't think there should be any issues here since all the actual read requests it issues were answered. Can you apply the patch below and run again (with journaler debugging enabled).

diff --git a/src/osdc/Journaler.cc b/src/osdc/Journaler.cc
index 544ac8c..734c576 100644
--- a/src/osdc/Journaler.cc
+++ b/src/osdc/Journaler.cc
@@ -882,6 +882,10 @@ bool Journaler::_is_readable()
   if (read_buf.length() >= sizeof(s) &&
       read_buf.length() >= sizeof(s) + s) 
     return true;  // yep, next entry is ready.
+  else {
+    dout (0) << "_is_readable read_bug.length() == " << read_buf.length()
+            << ", but need " << s + sizeof(s) << " for next entry" << dendl;
+  }

   // darn it!

@@ -908,6 +912,8 @@ bool Journaler::_is_readable()
             << " for len " << s << " entry" << dendl;
     temp_fetch_len = need;
   }
+
+  dout(0) << "_is_readable: not readable, returning false" << dendl;
   return false;
 }

Actions

Copy link

#23

Updated by Greg Farnum about 11 years ago

Status changed from New to In Progress
Priority changed from Normal to High

Actions

Copy link

#24

Updated by Greg Farnum about 11 years ago

Okay, so the next entry is >40MB and we have 38MB in our read buffer. I'm not certain, but I think our use of "temp_fetch_len" is just broken. Let's try increasing the "journaler prefetch periods" config member to 20; if that does fix it my supposition is correct and I'll go in more carefully.

Actions

Copy link

#25

Updated by Greg Farnum about 11 years ago

Subject changed from mds replay forever to Journaler: _is_readable() and _prefetch() don't communicate correctly
Description updated (diff)
Category set to 1
Status changed from In Progress to 12

Actions

Copy link

#26

Updated by Andras Elso about 11 years ago

Can i continue testing cephfs, or you make the fix quickly for this bug, and i can verify it on my system?

Actions

Copy link

#27

Updated by Greg Farnum about 11 years ago

Status changed from 12 to Fix Under Review

There were a couple related bugs which prevented this from working right. I don't guarantee it's bug-free now, but the issue in question here is dealt with. I tested it on a vstart cluster by setting journaler prefetch periods = 1 and mds log segment size = 1024, then restarting the MDS while under load. It worked. Hurray.

Pushed the patch to wip-journaler-4618.

Actions

Copy link

#28

Updated by Greg Farnum about 11 years ago

Project changed from Ceph to CephFS
Category deleted (1)

Actions

Copy link

#29

Updated by Andras Elso about 11 years ago

Confirmed, i tested with my system, and the journal-check can load the journal.

But, there is a line in commit:

ldout(cct, 0) << "_prefetch" << dendl;

Need this line with 0 debug level?

Actions

Copy link

#30

Updated by Greg Farnum about 11 years ago

Heh, no; that was supposed to be a 10. Re-pushed; thanks!

Actions

Copy link

#31

Updated by Greg Farnum about 11 years ago

Assignee changed from Greg Farnum to Sam Lang

Said he could look at this for me today.

Actions

Copy link

#32

Updated by Greg Farnum about 11 years ago

Status changed from Fix Under Review to Resolved

Merged into next in commit:8eb5465c10840d047a894d1a4f079ff8b8d608b5. This would apply to bobtail as well if we decide we want to backport it, but my inclination is to not bother — it's not a likely case and has workarounds.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #4618

Journaler: _is_readable() and _prefetch() don't communicate correctly

Updated by Ian Colle about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Zheng Yan about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Sage Weil about 11 years ago

Updated by Zheng Yan about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Andras Elso about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago

Updated by Greg Farnum about 11 years ago