Project

General

Profile

Actions

Feature #8563

open

mds: permit storing all metadata in metadata pool

Added by Alexandre Oliva almost 10 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
other
Tags:
hard
Backport:
Reviewed:
Affected Versions:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

(this had originally been filed as issue #8230, which was hijacked into an unrelated issue)

I'm speaking specifically about the ceph.parent attribute, that the mds accesses during recovery, and maintains in order to backlink data files to the parent dirs for purposes of hardlink handling.

This means the replication count for metadata is not obeyed for this attribute, rendering the entire filesystem consistency more fragile (assuming data has lower replication count, that is).

It also slows down recovery, assuming metadata pools are stored in faster disks.

Although attaching a metadata attribute to a data object is a sensible design decision, to avoid creating additional objects in the metadata pool, I wish creating such additional objects and attaching the ceph.parent attribute to it were at least a filesystem option.

Actions #1

Updated by Zheng Yan almost 10 years ago

I like this idea too, but currently have no time to implement it.

Actions #2

Updated by Greg Farnum almost 10 years ago

My concern with this is that part of the point is to provide a way to recover data into the hierarchy even if we've lost the associated metadata. Moving these into a different pool breaks that guarantee.
Additionally, while I'd like to reduce the extra seeks involved on file creates (probably by having the client do it), I think most of your performance issues come from having a large existing cluster without any backpointers, rather than from inherent issues. (Once you've dirtied all your inodes that part will cease being an issue.)

Which doesn't mean we won't make it an option, but those are my concerns with investing the effort on it.

Actions #3

Updated by Alexandre Oliva almost 10 years ago

I like the notion of being able to recover at least part of the tree structure from data pools alone. Maybe the option should be tri-state, so that the parent metadata could be stored in data pools, to enable this sort of disaster recovery, in the metadata pools, to enable faster mds recovery, or both. I think I can find my way into implementing such an option, if you guys agree it's something desirable.

As for your suspicion, I observe delayed parent xattr requests in 3 different scenario: (i) getxattr during mds recovery, (ii) setxattr a while after creating new files, and (iii) setxttr a while after creating a hardlink farm of an oldish tree. Your suspicion only covers case (iii), and only if a parent xattr version bump is needed. A while ago, I went over all inodes in the cluster and made sure all of them had ceph.parent set. You might even remember a patch I posted to that end, enabling a client to issue a setxattr op to direct the mds to force a ceph.parent update for an inode. I used that patch to ensure all reachable inodes in my cluster had the attribute set. There were about half a dozen inodes that were not reachable, and that hadn't been for quite a while, that did't have the xattr and that I couldn't find, not even in the stray directories, but I didn't pursue that any further. I hoped they'd eventually go away, but I've been sort of waiting for cephfsck since then ;-)

Actions #4

Updated by Greg Farnum almost 10 years ago

Can you talk about cases (i) and (ii) in a little more detail? And how you're observing the delayed xattr requests?

Actions #5

Updated by Alexandre Oliva almost 10 years ago

I observe the slow requests with ceph -w, or watching osd log files.

(i) The mds getxattr parent requests during recovery occur during rejoin; the number of such requests seems to be correlated with the number of caps/locks held by clients that survived the mds restart. Say, if I walk a large directory tree within a ceph-fuse mount point, stat()ing each file, leave the fs mostly alone for a while, and then restart the mds, it will issue tons of getxattr requests.

(ii) The mds setxattr parent requests after creation of inodes are part of the normal creation of the ceph.parent attribute: as the mds prepares to flush an old mds log segment, it will issue setxattr parent requests for all newly-created (or otherwise requiring ceph.parent update) inodes, and will wait for the operations to complete before trimming the log. When I rsync a tree with thousands of files into the cephfs, or I explode a large tarball, a while later (when the inodes are to be flushed from the journal) I observe this sort of behavior.

Is this the sort of detail you were looking for?

Actions #6

Updated by Greg Farnum almost 10 years ago

Yeah. I'm actually not sure why it would be sending out rados xattr requests on restart unless you'd lost clients, but the rest of it makes sense. It sounds like if we had the clients responsible for setting initial parent traces on file create that would do a lot of good (combining the data and parent write into a single IO). Thanks!

Actions #7

Updated by Greg Farnum almost 8 years ago

  • Category set to Performance/Resource Usage
  • Component(FS) Client, MDS added
Actions #8

Updated by Patrick Donnelly about 6 years ago

  • Tracker changed from Bug to Feature
  • Subject changed from cephfs stores metadata in data pools to mds: permit storing all metadata in metadata pool
  • Tags set to hard
  • Component(FS) deleted (Client)
Actions

Also available in: Atom PDF