Bug #4618
closedJournaler: _is_readable() and _prefetch() don't communicate correctly
0%
Description
The Journaler has mechanisms to try and read extra data if an event is large enough that it exceeds the current prefetch, but they don't work properly. _is_readable() is setting the temp_fetch_len member based on how much more data it needs, but _prefetch() is interpreting it from read_pos. Seeing this bug requires having some pretty large Events in the log, and there's an easy workaround (increase the prefetch period config setting), but it's hard to diagnose and a bit embarrassing. ;)
I haven't got any patches to fix this issue, yet.
Original bug:
Our mds replay the journal forever, doesnt want to active.
There are 5 mons, and 5 osd.
max_mds is 2.
'ceph-mds2' mds.0.272 up:resolve seq 12 'ceph-mds1' mds.1.241 up:replay seq 25
The logs are here: http://elbandi.net/ceph/
Version everywhere: 0.56.4
Updated by Ian Colle about 11 years ago
- Assignee set to Greg Farnum
- Priority changed from Normal to High
- Source changed from Development to Community (dev)
Updated by Greg Farnum about 11 years ago
- Priority changed from High to Normal
I'll check my assumptions today (already downloaded the logs), but with multiple active MDSes this doesn't warrant a "High" right now.
Updated by Andras Elso about 11 years ago
I changed back to max_mds 1. same result:
dumped mdsmap epoch 49925 epoch 49925 flags 0 created 2013-02-27 17:00:00.919407 modified 2013-04-02 21:35:43.060313 tableserver 0 root 0 session_timeout 60 session_autoclose 300 last_failure 49925 last_failure_osd_epoch 4410 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object} max_mds 1 in 0,1 up {0=20876,1=20735} failed stopped data_pools [0,5,12] metadata_pool 1 20876: x.x.34.82:6800/15681 'ceph-mds1' mds.0.287 up:resolve seq 2 20735: x.x.34.83:6800/16842 'ceph-mds2' mds.1.252 up:replay seq 1
Updated by Greg Farnum about 11 years ago
Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?
Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.
Updated by Andras Elso about 11 years ago
Greg Farnum wrote:
Strange, it looks like you have an MDS log of about 1236MB, which is...large. What config options are you setting?
Can you re-run this after adding "debug ms = 1" to your config? It looks to me like your replay just stops, which usually indicates that it's trying to get an object from an OSD that is failing to come back.
the logs are big: first is ~3G, second is ~200M. the config was "debug mds = 20". There was ~200-300k files on cephfs.
I created new logs with "debug ms = 1" and "debug mds = 20", watch the 2_* files on link.
Updated by Greg Farnum about 11 years ago
Sorry, I mean the mds journal, not the debug logs, when referring to the size.
Updated by Andras Elso about 11 years ago
Greg Farnum wrote:
Sorry, I mean the mds journal, not the debug logs, when referring to the size.
So the mds journal doesnt "commited" to a final state?
What can i do? need you more asserts/debug lines somewhere?
Updated by Greg Farnum about 11 years ago
The MDS maintains a journal that it writes metadata into before committing the aggregated updates into the actual inodes. This journal is what it's "replaying" after a reboot.
The part I'm curious about is that according to your log the MDS which isn't finishing replay has a log of over a gigabyte, which is larger than it should normally be allowed to get (your other MDS has a log of 167 MB, which is a bit long but in the range of possibility). Can you upload or pastebin:
1) The output of ceph -s.
2) The output of your MDS config dump from one of your daemons (see http://ceph.com/docs/master/rados/configuration/ceph-conf/#viewing-a-configuration-at-runtime).
Updated by Andras Elso about 11 years ago
Here is the status: http://pastebin.com/x1XEvuWc
Here is the config dump: http://pastebin.com/YTFbY5jW
Updated by Greg Farnum about 11 years ago
Thanks. (For future onlookers, the summary of those links is that everything is perfectly normal and as it should be, except an MDS is stuck in replay.)
Just a quick check of the messages indicates that the OSDs are giving responses to everything the mds requests, too, so it's not a failure to retrieve disk objects.
Updated by Andras Elso about 11 years ago
if you tell me (here or irc) where to add new debug/assert lines, we can hunt down this bug.
Updated by Sage Weil about 11 years ago
Don't forget #3351.. if the osd returns a short read on an object before the end of the journal, the Journaler replay code gets confused and gets stuck. Every object should be exactly 4MB except the very last one, or else!
Updated by Greg Farnum about 11 years ago
Yeah, that all looks good too. My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim and somehow causes whatever the hang is (which follows a slave update replay).
Updated by Andras Elso about 11 years ago
here is a logcut with "debug journaler = 20": http://pastebin.com/nrzJg87E
Updated by Zheng Yan about 11 years ago
see commit 7e04504d3ed119bb43a4eb99ca524b39dc3696bc. But the bug should just make replay slow.
Updated by Greg Farnum about 11 years ago
Sorry, but I'm a bit lost about why that might apply here. Are you just speculating or did something in the logs look like this issue?
(In particular it sure looks to me like replay has just stopped, but I can check for odd mds messages as well.)
Updated by Sage Weil about 11 years ago
I think of it every time i hear "stuck in replay", that's all. I havne't looked at the logs or anything.
Updated by Zheng Yan about 11 years ago
you said "My off-hand guess is that something isn't getting cleaned up properly with the slave requests, which leads to preventing a journal trim". that bug can prevent journal from trimming. But can't explain why does journal replay stop.
Updated by Andras Elso about 11 years ago
just a guess: with journaler debug, there is a line:
mds.1.journaler(ro) wait_for_readable at 67811200485 onreadable 0x20fa5e0
I added a debug line to https://github.com/ceph/ceph/blob/master/src/mds/MDLog.cc#L490 , i got this:
mds.1.log _replay_thread read_pos 67811200485, write_pos 68625972210
does this mean anything?
Updated by Greg Farnum about 11 years ago
Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.
Updated by Andras Elso about 11 years ago
Greg Farnum wrote:
Are those logs posted somewhere? That indicates it's waiting to be allowed to read the stuff past where it stopped, but it's not immediately clear what it would be waiting on.
Yes. Watch the link, where the others are.
Updated by Greg Farnum about 11 years ago
Okay, there's not a lot there so apparently it doesn't have as much data as it thinks it needs in order to read the next entry. But it's not kicking off any attempts to read more either, which is odd. I'm checking through the prefetch logging more carefully now, but I don't think there should be any issues here since all the actual read requests it issues were answered. Can you apply the patch below and run again (with journaler debugging enabled).
diff --git a/src/osdc/Journaler.cc b/src/osdc/Journaler.cc index 544ac8c..734c576 100644 --- a/src/osdc/Journaler.cc +++ b/src/osdc/Journaler.cc @@ -882,6 +882,10 @@ bool Journaler::_is_readable() if (read_buf.length() >= sizeof(s) && read_buf.length() >= sizeof(s) + s) return true; // yep, next entry is ready. + else { + dout (0) << "_is_readable read_bug.length() == " << read_buf.length() + << ", but need " << s + sizeof(s) << " for next entry" << dendl; + } // darn it! @@ -908,6 +912,8 @@ bool Journaler::_is_readable() << " for len " << s << " entry" << dendl; temp_fetch_len = need; } + + dout(0) << "_is_readable: not readable, returning false" << dendl; return false; }
Updated by Greg Farnum about 11 years ago
- Status changed from New to In Progress
- Priority changed from Normal to High
Updated by Greg Farnum about 11 years ago
Okay, so the next entry is >40MB and we have 38MB in our read buffer. I'm not certain, but I think our use of "temp_fetch_len" is just broken. Let's try increasing the "journaler prefetch periods" config member to 20; if that does fix it my supposition is correct and I'll go in more carefully.
Updated by Greg Farnum about 11 years ago
- Subject changed from mds replay forever to Journaler: _is_readable() and _prefetch() don't communicate correctly
- Description updated (diff)
- Category set to 1
- Status changed from In Progress to 12
Updated by Andras Elso about 11 years ago
Can i continue testing cephfs, or you make the fix quickly for this bug, and i can verify it on my system?
Updated by Greg Farnum about 11 years ago
- Status changed from 12 to Fix Under Review
There were a couple related bugs which prevented this from working right. I don't guarantee it's bug-free now, but the issue in question here is dealt with. I tested it on a vstart cluster by setting journaler prefetch periods = 1 and mds log segment size = 1024, then restarting the MDS while under load. It worked. Hurray.
Pushed the patch to wip-journaler-4618.
Updated by Greg Farnum about 11 years ago
- Project changed from Ceph to CephFS
- Category deleted (
1)
Updated by Andras Elso about 11 years ago
Confirmed, i tested with my system, and the journal-check can load the journal.
But, there is a line in commit:
ldout(cct, 0) << "_prefetch" << dendl;
Need this line with 0 debug level?
Updated by Greg Farnum about 11 years ago
Heh, no; that was supposed to be a 10. Re-pushed; thanks!
Updated by Greg Farnum about 11 years ago
- Assignee changed from Greg Farnum to Sam Lang
Said he could look at this for me today.
Updated by Greg Farnum about 11 years ago
- Status changed from Fix Under Review to Resolved
Merged into next in commit:8eb5465c10840d047a894d1a4f079ff8b8d608b5. This would apply to bobtail as well if we decide we want to backport it, but my inclination is to not bother — it's not a likely case and has workarounds.