Bug #9427

osdc/ 405: FAILED assert(last_written.write_pos >= last_written.expire_pos)

Added by Sage Weil about 9 years ago. Updated about 9 years ago.

Target version:
% Done:


3 - minor
Affected Versions:
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):


2014-09-10 20:44:59.843880 7fc438331700 -1 osdc/ In function 'void Journaler::_write_head(Context*)' thread 7fc438331700 time 2014-09-10 20:44:59.792758
osdc/ 405: FAILED assert(last_written.write_pos >= last_written.expire_pos)

 ceph version 0.85-669-gc050afd (c050afdbe01e4204de988ef2dba28dd4ba1c34e2)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x90e9bf]
 2: (Journaler::_write_head(Context*)+0x82f) [0x7b8d2f]
 3: (Journaler::_flush(C_OnFinisher*)+0x118) [0x7bb508]
 4: (Journaler::flush(Context*)+0x3d) [0x7bb9ad]
 5: (MDLog::_submit_thread()+0x723) [0x79cfa3]
 6: (MDLog::SubmitThread::entry()+0xd) [0x5a09cd]
 7: (()+0x7e9a) [0x7fc442f27e9a]
 8: (clone()+0x6d) [0x7fc441adc3fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

on burnupi21


#1 Updated by Sage Weil about 9 years ago

2014-09-10 21:22:51.698712 7f412cd8a700 10 mds.0.journaler(rw) append_entry len 3025 to 135243619~3045
2014-09-10 21:22:51.698728 7f412cd8a700 20 mds.0.journaler(rw) flush not delaying flush
2014-09-10 21:22:51.698730 7f412cd8a700 10 mds.0.journaler(rw) _do_flush flushing 135243619~3045
2014-09-10 21:22:51.698823 7f412cd8a700 10 mds.0.journaler(rw) _do_flush (prezeroing/prezero)/write/flush/safe pointers now at (159383552/159383552)/135246664/135246664/135243619
2014-09-10 21:22:51.698826 7f412cd8a700 20 mds.0.journaler(rw) _issue_prezero target 159383552 <= prezeroing_pos 159383552
2014-09-10 21:22:51.698834 7f412cd8a700 10 mds.0.journaler(rw) write_head loghead(trim 4194304, expire 1875432704405, write 135243619, stream_format 1)

expire_pos didn't get reset during jouranl upgrade?

#2 Updated by Sage Weil about 9 years ago

ESubtreeMap has an expire_pos field, and we set it in ESubtreeMap::replay() if it is > the current expire pos. I think the journal rewrite needs to adjust this value, or we need to guard it with a journal version or something. Or drop it entirely...

#3 Updated by Sage Weil about 9 years ago

  • Project changed from Ceph to CephFS

wip-mds has hacky workaround

#4 Updated by John Spray about 9 years ago

The history:

commit 6337842c464ecb9055683270a1faaac313b93257
Author: Greg Farnum <>
Date:   Tue Nov 30 14:00:32 2010 -0800

    mds: add expire_pos to the ESubtreeMap.

    This will allow more efficient trimming during standby_replay.

    Signed-off-by: Greg Farnum <>

commit d496b65144739d4daf716d81d4ac517845e701df
Author: Greg Farnum <>
Date:   Tue Feb 15 09:00:06 2011 -0800

    Journaler: add some checks for expire_pos.

    I don't think these are necessary checks, but the expire_pos >= trim_pos
    invariant got broken somehow by johnl, and these checks won't hurt!

    Signed-off-by: Greg Farnum <>

I think that now that the journal encoding format is somewhat changeable, having absolute references to the serialized positions inside the journal entries themselves
is bad news -- we could zero them out during the reformat, or we could drop this field to make our lives simpler in future.

Looking into this, I'm a bit concerned that EMetaBlob::truncate_finish might have a similar problem.

#5 Updated by John Spray about 9 years ago

So rewriting the truncate_finish part isn't too hard if we want to do that:

(and can do a similar but simpler thing to zero out the expire_pos bit).

Greg: I don't understand what the original benefit is to feeding the expire_pos value here to stand by MDSs, could we get the same benefit by using a log segment sequence number here instead of a raw log position?

#6 Updated by Greg Farnum about 9 years ago

It doesn't need to be an absolute offset that gets fed to the standby-replay MDS, as long as it can use the information to expire old log segments and trim its cache. I think we added the expire_pos value just because they might not catch all the way up and get a chance to re-read the journal header.
(Alternative solution: make the journaled a little more complicated so that the standby-replay mds actually reads the header more often than just after running out of journal to read.)

#7 Updated by Sage Weil about 9 years ago

  • Assignee set to John Spray

#8 Updated by John Spray about 9 years ago

  • Status changed from New to Fix Under Review

#9 Updated by John Spray about 9 years ago

For posterity, the manual test procedure for the changes to rewrite that fix this issue:

The manual test procedure for this, with a vstart cluster is:
 * Create a fresh filesystem with firefly
 * Mount the FS and generate a decent number of operations (I copy my /etc onto it) so that we get at least a few log segments in play
 * Open a file, write some data to it, then open it again for truncate (I use open('foo', 'w') from python console)
 * Check MDS logs, it should have some stuff about truncates.
 * Unmount and stop the MDS.
 * Upgrade to master (in my case I'm just checking it out and recompiling my tree)
 * Restart the MDS.

Check that (a) it comes up happily, and (b) there are messages in the MDS log about "rewrite_truncate_finish applying segment seq mapping" and "_reformat journal zeroing expire pos" 

Currently the next thing that happens after that is that the MDS crashes with #9444

#10 Updated by Greg Farnum about 9 years ago

  • Status changed from Fix Under Review to Resolved

Merged to master in commit:e06f4251ac36503d33f203567ada1b096119ab80.
Immediately cherry-picked to giant in commit:c3c6468cad8fa2c976d6c325d93a34e9096da86b

Also available in: Atom PDF