Bug #18730: mds: backtrace issues getxattr for every file with cap on rejoin - CephFS - Ceph

Actions

Copy link

Bug #18730

closed

mds: backtrace issues getxattr for every file with cap on rejoin

Added by John Spray about 7 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

High

Assignee:

Zheng Yan

Category:

Performance/Resource Usage

Target version:

Ceph - v13.0.0

% Done:

Source:

Development

Tags:

Backport:

luminous

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In Server::handle_client_reconnect, a inode numbers that had client caps but were not in cache are passed into MDCache::rejoin_recovered_caps. This puts them into MDCache::cap_imports. Later, during rejoin, in MDCache::process_imported_caps, cap_imports is iterated over and every item generates a call to MDCache::open_ino (i.e. a getxattr to the data pool to read the backtrace).

This is massively inefficient because (in almost any real workload) many of the files being resolved are in fact in the same directory as one another, and whichever one is fetched first will fetch the whole dirfrag, rendering the backtrace lookups for all the other files in that fragment redundant.

In practice, this is causing a user to experience 15-minute long rejoin phases on a system with ~5m files with capabilities:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html ("[ceph-users] MDS flapping: how to increase MDS timeouts?")

One simple solution would be to throttle the number of calls to open_ino from process_imported_caps to some configurable threshold (e.g. 1000 by default), so that by the time they came to try resolving the next batch of inodes, the dirfrags from the first batch would already be in cache and the open_inos would be no-ops (need to check that open_ino really handles the no-op case efficiently).

It might even make sense to do that throttling for open_ino in general, as if a workload was hitting lots of hard links at the same time, it's possible that they would hit a similar case where they generated far more getxattrs than needed.

Actions

Copy link

Updated by Zheng Yan about 7 years ago

I think we should design a new mechanism to track in-use inodes (current method isn't scalable because it journals all in-use inode in each log segment)

Actions

Copy link

Updated by Xiaoxi Chen about 7 years ago

Zheng Yan wrote:

I think we should design a new mechanism to track in-use inodes (current method isn't scalable because it journals all in-use inode in each log segment)

Sorry , Zheng, one question, why we need to fetch the backtrace from default datapool first, and then retry on the real pool that file reside?

Not sure if I understand correctly , it seems when creating an file, the backtrace will reside on both default_pool and target_pool, but later, if we mv the file to another path, the update only goes to target_pool?

And , in which case the backtrace will exists only in matadata pool ? I am trying to understand https://github.com/ceph/ceph/blob/master/src/mds/MDCache.cc#L8341-L8351

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Subject changed from MDS issues backtrace getxattr for every file with cap on rejoin to mds: backtrace issues getxattr for every file with cap on rejoin
Assignee set to Zheng Yan
Priority changed from Normal to High
Target version changed from v12.0.0 to v13.0.0
Source set to Development
Backport set to luminous