Project

General

Profile

Actions

Feature #91

closed

mds: up:shadow mode

Added by Sage Weil almost 14 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

50%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Component(FS):
Labels (FS):
Pull request ID:

Description

replay client while in standby, so we can take over immediately on failure.

Actions #1

Updated by Sage Weil over 13 years ago

  • Assignee set to Greg Farnum
  • Priority changed from Low to Normal
  • Target version set to v0.24

Update the journaler interface to allow the MDS to 'tail' the journal... periodically check to see if it's been extended and read events as they are written. (Someday we can use the new watch/notify to make this efficient!)

The shadow mds will also need to 'trim' the expired part of the journal by periodically checking the jouranler header (and expire_pos) and trimming old LogSegments and associated metadata out of its cache. This will probably be tricky, but whoever does it will hopefully come out with a thorough grasp of how the replay works and can write some of it down!

Actions #2

Updated by Sage Weil over 13 years ago

  • Estimated time set to 16:00 h
  • Source set to 5
Actions #3

Updated by Greg Farnum over 13 years ago

  • Status changed from New to In Progress

I've been getting some proper time in on this on and off over the last few days. Pushed the Journaler changes to the branch standby_replay. Will be starting on making the MDS tail and update next!

Actions #4

Updated by Greg Farnum over 13 years ago

  • % Done changed from 0 to 20

Updated Journaler to make new interface options asynchronous.
Presently working on how to disambiguate between a one-shot and continuous replay (probably a new state) on the MDS and Monitor. Then implement basic continuous replay without worrying much about evicting stuff from the cache. Finally figure out how to effectively evict stuff from the cache without breaking interaction with the other MDSes.

Actions #5

Updated by Sage Weil over 13 years ago

  • Target version changed from v0.24 to v0.25
Actions #6

Updated by Sage Weil over 13 years ago

  • Translation missing: en.field_position deleted (365)
  • Translation missing: en.field_position set to 1
Actions #7

Updated by Sage Weil over 13 years ago

  • Translation missing: en.field_position deleted (1)
  • Translation missing: en.field_position set to 3
Actions #8

Updated by Greg Farnum over 13 years ago

  • % Done changed from 20 to 50

I have yet to implement trimming, but the basic restarting-replay bits are now in place along with hooks to make it start. Testing is revealing a fair number of issues with the Journaler and MDLog, though -- they don't much like repeating this process!

Actions #9

Updated by Greg Farnum over 13 years ago

Okay, this seems to be working now. Had to adjust how the Journaler treated read_pos and to fix a few of my new re-read functions as they weren't setting all variables properly, and now it loops happily.
Now I'm implementing the state change OUT of standby_replay, so these machines can take over. (Won't take long.)
Trimming will be the last thing to do, but it's starting to look simpler.

Actions #10

Updated by Greg Farnum over 13 years ago

  • Status changed from In Progress to Resolved

Well, this seems to be working as best I can tell.

There are some odd issues with virtual memory usage growing by leaps and bounds, but heap analyzer tools (tcmalloc heapdump, massif) indicate that it's not actually using that much memory, so....fragmentation?
Yehuda and Sage can't come up with anything so we decided to table it unless we hear about real problems. Merged into unstable!

Actions #11

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
  • Target version deleted (v0.25)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF