Project

General

Profile

Actions

Bug #1435

closed

mds: loss of layout policies upon mds restart

Added by Alexandre Oliva over 12 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Cluster running ceph 0.33 + patch to add support for “ceph mds add_data_pool”.

I set up layout policies for various directories (renamed for privacy) on an otherwise empty cluster:

cephfs a set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 3
cephfs b set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 3
cephfs l/c set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 3
cephfs l/d set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 4
cephfs l/e set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 5
cephfs l/f set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 6
cephfs g set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 7
cephfs h set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 8
cephfs h/i set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 9
cephfs l/j set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 10
cephfs l/k set_layout -u 4194304 -s 4194304 -c 1 -o -1 -p 10

Cluster has 3 mons, 3 mdses, 3 osdes, but one osd is out and only one mds is enabled.

After setting up the crushmap and the policies as above, I ran show_layout for each of the directories, and they were correct.

I started rsyncing, and each file got placed in the correct pool. However, when it started uploading to one of the directories above, it started using the “data” pool number 0. At that point, show_layout showed most (all? not sure) directories had lost their policies.

Reinstating the policies and cleaning up the misplaced files, they started being uploaded to the right place.

At some point I had to restart one of the servers, that hosted the active mds and one of the active osdes. So I turned on another mds, and when it took over, the policies were (partially?) gone. Switching back to the originally active mds, after restart, also caused (partial?) loss of policies. It seems like at every mds change, the layout policies of directories are lost on all clients.


Files

LOCAL-verbose-track-layout.patch (13.6 KB) LOCAL-verbose-track-layout.patch patch that logs layout statuses and changes Alexandre Oliva, 02/16/2013 12:17 PM
cdir-fetch-recover-dir-layout.patch (925 Bytes) cdir-fetch-recover-dir-layout.patch patch that seems to fix the bug Alexandre Oliva, 02/16/2013 06:31 PM
Actions #1

Updated by Sage Weil over 12 years ago

  • Category set to 1
  • Target version set to v0.35
Actions #2

Updated by Sage Weil over 12 years ago

Okay, I see at least one problem.. the IFILE lock state isn't sharing the default_file_layout with other nodes. CInode::encode_replica() is at least, but probably what's going on is that the recovery rejoin stuff relies on the lock state helpers.

Also, Greg, looking at this now, I'm wondering why we even need the default_layout pointer business at all. Why didn't we just store the layout policy in inode_t::layout? And set a flag in the inode indicating that it's defined...

Actions #3

Updated by Sage Weil over 12 years ago

Seriously, if we just put it in the layout field, this

  i = pfile ? pi:oi;
  if (is_file()) {
    e.layout = i->layout;
  } else {
    if (ppolicy && get_projected_dir_layout())
      e.layout = *get_projected_dir_layout();
    else if (default_layout)
      e.layout = default_layout->layout;
    else
      memset(&e.layout, 0, sizeof(e.layout));
  }

turns into
e.layout = i->layout;

and no need for a flag... we just zero it out.

Let's talk about this tomorrow! I'm probably forgetting something..

Actions #4

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.35 to v0.36
Actions #5

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.36 to v0.37
Actions #6

Updated by Alexandre Oliva over 12 years ago

Wasn't this fixed in 0.35? I haven't lost directory layout information any more.

Actions #7

Updated by Sage Weil over 12 years ago

  • Status changed from New to Resolved

Great news, thanks. We haven't tested it.

Actions #8

Updated by Alexandre Oliva over 12 years ago

I lied :-(

I had been running with a single mds for a while, and even though it restarted a number of times, it didn't seem to lose layout info.

Just yesterday, for unrelated reasons, I brought up another mds and let it take over, and this morning I noticed newly-created files were being placed in the wrong pool. layout settings for a number of directories had been lost :-(

Actions #9

Updated by Sage Weil over 12 years ago

  • Status changed from Resolved to In Progress
  • Target version changed from v0.37 to v0.38

Can you do a bit of legwork and help us get a process to reproduce this? Once we have that it's easy to fix.

Probably something like:

- set dir policy
- restart mds
- mount from new client (to avoid caching issues)
- examine dir policy

?

Actions #10

Updated by Greg Farnum over 12 years ago

I would assume this is just the IFILE lock state thing you talked about earlier?

There were a few other bugs that got fixed up IIRC, but we never touched that one.

Actions #11

Updated by Sage Weil over 12 years ago

Greg Farnum wrote:

I would assume this is just the IFILE lock state thing you talked about earlier?

There were a few other bugs that got fixed up IIRC, but we never touched that one.

I thought that got fixed.

I had a cleanup branch that simplified the layout storage, but never tested or merged it. We need to be able to reproduce this.

Actions #12

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.38 to v0.39
Actions #13

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 1
  • Translation missing: en.field_position changed from 1 to 985
Actions #14

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40
Actions #15

Updated by Sage Weil over 12 years ago

  • Subject changed from loss of layout policies upon mds restart to mds: loss of layout policies upon mds restart
  • Status changed from In Progress to Need More Info

Alexandre-

Same situation here. If you can produce a full mds log that includes the creation of the initial layouts, the restart, and their subsequent disappearance, it shouldn't be too hard to nail this down.

Actions #16

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position deleted (1077)
  • Translation missing: en.field_position set to 200
Actions #17

Updated by Alexandre Oliva over 12 years ago

I've been looking at the MDS implementation, and I have a theory now.

It was probably not the MDS restarts that were causing the problem or, if they were, that's fixed by now.

What is going on now is loss of dir layout information while writing a lot of data to other directories. I.e., if siblings a and b both have dirlayouts set, and I'm writing lots of files inside a, I never lose a's dirlayout info, but after a while b loses its dirlayout info. So, when an rsync that encompasses both a and b gets to b, it will start using the default layout.

My hunch is that directory b is being expired from MDCache, and not having its dirlayout info restored from the directory inode info.

Now, gotta figure out where this was supposed to happen, and then do it (or discard the theory ;-)

Actions #18

Updated by Sage Weil over 11 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)
Actions #19

Updated by Greg Farnum over 11 years ago

  • Priority changed from Normal to High

I'm guessing we want to move this up the queue; will discuss in bug scrub tomorrow!

Actions #20

Updated by Ian Colle over 11 years ago

  • Assignee set to Greg Farnum
Actions #21

Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to 12

wip-mds-layout2
needs to be rebased reviewed and tested!

Actions #22

Updated by Greg Farnum about 11 years ago

Wait, never mind. Too excited and didn't look closely enough at the projected node struct! :)

Actions #23

Updated by Greg Farnum about 11 years ago

I've been totally unable to come up with a scenario for how this could happen via code inspection, so I think I'm just going to have to spin up a bunch of teuthology jobs exercising this code and see what falls out.

As part of that I updated the patch which moves the directory layout code into the inode_t::layout field for #1499.

Actions #24

Updated by Alexandre Oliva about 11 years ago

I've been looking some more into this. It seems that my hunch was correct: the loss seems to occur at cluster recovery time. I've patched the mds code a bit to display layout information at various points, and I found out that, during rejoin, CDir::fetched records the subdirs of the root dir in the cache without a default_layout in the CInode::layout (although the layout info is correct in CInode::inode), and nothing else brings the default_layout of the subdirs in. When I query it, the result reflects what's (missing) in default_layout, not what's in CInode::inode; when I set it, a projected inode gets the new setting, and it is shortly moved to default_layout, remaining in effect until the end of the session. If the change remains in the journal when the mds restarts, it will remain effective after the restart, but if it expired from the journal, nothing will restore default_layout.

I suspect this will be fixed by initializing default_layout from the inode layout in CDir::fetched, but I wonder if there are other properties of the subdir that would have to be explicitly fetched from elsewhere and, if so, whether the (partially) restored subdir should be placed in a different state that would get it fully fetched when needed. Thoughts?

Here are some messages I got using the attached patch. Directory 9 was active in the session before, so it got recovered correctly upon mds restart:

2013-02-16 15:38:28.610543 7f0d940a5700 0 mds.0.cache creating system inode with ino:1
2013-02-16 15:38:29.968326 7f0d90890700 0 CInode ctor dir 9 layout pool 0 inode pool 0
2013-02-16 15:38:29.968340 7f0d90890700 0 update inode dir 9 layout pool 6 inode pool 6

other directories weren't as lucky:

2013-02-16 15:39:03.895479 7f0d940a5700 1 mds.0.957 rejoin_joint_start
2013-02-16 15:39:04.001437 7f0d940a5700 0 CInode ctor dir 0 layout pool 0 inode pool 0
2013-02-16 15:39:04.001462 7f0d940a5700 0 CDir::fetched dir 0 layout pool 0 inode pool 7
2013-02-16 15:39:04.001818 7f0d940a5700 0 CInode ctor dir 11 layout pool 0 inode pool 0
2013-02-16 15:39:04.001833 7f0d940a5700 0 CDir::fetched dir 11 layout pool 0 inode pool 8
2013-02-16 15:39:04.002133 7f0d940a5700 0 CInode ctor dir 10 layout pool 0 inode pool 0
2013-02-16 15:39:04.002156 7f0d940a5700 0 CDir::fetched dir 10 layout pool 0 inode pool 3

later on, when I queried the layouts, I got:

2013-02-16 15:58:00.170531 7f0d940a5700 0 add primary dentry dir 10 layout pool 0 inode pool 3
2013-02-16 15:58:00.212051 7f0d940a5700 0 encode inode_stat dir 10 layout pool 0 inode pool 3

then I (re)set them to the correct values:

2013-02-16 15:58:48.953160 7f0d940a5700 0 get not projected dir 10 layout pool 0 inode pool 3
2013-02-16 15:58:48.953254 7f0d940a5700 0 get not projected dir 10 layout pool 0 inode pool 3
2013-02-16 15:58:48.953265 7f0d940a5700 0 project_inode dir 10 layout pool 0 inode pool 3
2013-02-16 15:58:48.954409 7f0d940a5700 0 add primary dentry dir 10 layout pool 3 inode pool 3
2013-02-16 15:58:48.954690 7f0d940a5700 0 get projected dir 10 layout pool 3 inode pool 3
2013-02-16 15:58:48.954703 7f0d940a5700 0 encode inode_stat dir 10 layout pool 0 inode pool 3

and the changes became “permanent” (until they were no longer present in the journal):

2013-02-16 15:58:49.886759 7f0d940a5700 0 will pop projected_inode dir 10 layout pool 0 inode pool 3
2013-02-16 15:58:49.886800 7f0d940a5700 0 popped projected_inode dir 10 layout pool 3 inode pool 3

Actions #25

Updated by Alexandre Oliva about 11 years ago

The attached patch seems to fix the problem. It's clear I'm not entirely happy with the idea of testing the inode layout for an all-zeros representation ;-) but that was a minimal and safe change. I recall Sage had a larger patch that did away with default_layout altogether, using the inode layout instead, and that it wasn't enough to fix the problem back when I tried it; maybe something else changed and it would suffice now. The patch didn't apply cleanly anymore, and I'd rather not put it in because of the on-disk representation changes anyway. So I'm sticking with this for now I suppose you may want to use it for bobtail and leave on-disk changes for the next major release.

Actions #26

Updated by Alexandre Oliva about 11 years ago

Uhh, sorry, I forgot an artifact of the first patch made to the bug-fix patch. s/default_layout_()/default_layout/g fixes it.

Actions #27

Updated by Greg Farnum about 11 years ago

Oh, n/m, hurray for our sysadmins fixing hasty upgrades. :)

Actions #28

Updated by Greg Farnum about 11 years ago

  • Status changed from 12 to Won't Fix

This patch to use the inode layout instead of default_layout shouldn't be helping — directory layouts aren't written in to that inode layout at any point, and I don't see any way that a directory's inode layout would even be filled in.

However, thanks to your making me poke around in all this code from a different angle, I've finally figured out why we've been losing them. (Unless of course I'm just being over-eager again.) We're very careful throughout the CInode code to preserve layouts, and indeed we do so successfully. The problem is that the final flush to disk of an inode doesn't make use of those CInode helpers (AUGH SCREAM MUST FIX SPRINT NOW! ;) but instead uses a reimplementation in CDir::_encode_dentry — and the version of the on-disk encoding doesn't preserve layouts at all. Having it do so is currently a bit tricky, but luckily two of my tasks for the next sprint are to
1) rework the layout code to just store directory layouts in the inode::layout field (which will handle this problem, since the inode_t struct encoder is used by _encode_dentry)
2) rework the CDir::_commit functions to use a more modern encoding infrastructure.

So all aspects of this should get pretty well cleaned up very shortly. Hurray!

I'll have to discuss with Sage what we do for the old code though; I suspect we'll backport the layout store change to Bobtail but not try and fix up anything else.

Actions #29

Updated by Greg Farnum about 11 years ago

  • Status changed from Won't Fix to Fix Under Review

Okay, after discussing with Sage we've decided that placing the directory layouts into the inode_t::layout field is the best option, and I've got a patch under review that does that.
Unfortunately the layout data is lost so we can't fix it on older clusters, but this will prevent issues on newer ones. Assuming no problems and nothing else beats it in, it will be in the bobtail branch as commit:304f4502fa092de8140dcfd2cc8f8ef7853f501b.

Actions #30

Updated by Greg Farnum about 11 years ago

  • Status changed from Fix Under Review to Resolved

Change pushed to bobtail (commit:36ed407e0f939a9bca57c3ffc0ee5608d50ab7ed) and next (commit:6bd8781dda524f04bb56bcdac5aa5b13ba61b07d) branches. See #1499 for some details on testing.

Actions

Also available in: Atom PDF