Feature #58488: mds: avoid encoding srnode for each ancestor in an EMetaBlob log event - CephFS - Ceph

Actions

Copy link

Feature #58488

open

mds: avoid encoding srnode for each ancestor in an EMetaBlob log event

Added by Venky Shankar over 1 year ago. Updated 7 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Performance/Resource Usage

Target version:

Ceph - v19.0.0

% Done:

100%

Source:

Tags:

Backport:

pacific,quincy

Reviewed:

Affected Versions:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Description

This happens via MDCache::predirty_journal_parents() where MDCache::journal_dirty_inode() is called for each ancestor in the path hierarchy for dentry modification operations.

  for (const auto& in : lsi) {
    journal_dirty_inode(mut.get(), blob, in);
  }

In EMetaBlob::add_primary_dentry():

    bufferlist snapbl;
    const sr_t *sr = in->get_projected_srnode();
    if (sr)
      sr->encode(snapbl);

    lump.nfull++;
    lump.add_dfull(dn->get_name(), dn->get_alternate_name(), dn->first, dn->last,
                   dn->get_projected_version(), pi, in->dirfragtree,
                   in->get_projected_xattrs(), in->symlink, in->oldest_snap, snapbl,
                   state, in->get_old_inodes());

It seems that the srnode is encoded and persisted multiple times in the (EMetaBlob) log event. Persisting this once (for the log event) could reduce the size of the log event. This may need changes to replay and commiting code to teach it to look for the "singleton" srnode in the log event.

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Xiubo Li over 1 year ago

More detail about this:

For example for /AAAA/BBBB/CCCC/ we create snapshots under /* and */AAAA/BBBB/, and later if we create a file under for /AAAA/BBBB/CCCC/ and when submitting the MDLog event for this creation it will predirty all the parents:

predirty_journal_parents frag->inode on [dir 0x10000000002 /AAAA/BBBB/CCCC/ [2,head]
predirty_journal_parents frag->inode on [dir 0x10000000001 /AAAA/BBBB/ [2,head]  -->snapralm2
predirty_journal_parents frag->inode on [dir 0x10000000000 /AAAA/ [2,head]
predirty_journal_parents frag->inode on [dir 0x1 / [2,head]  --> snaprealm1

And the predirty_journal_parents() will encode both the snaprealm1 and snaprealm2 the snapshots to the MDLog entries:

predirty_journal_parents() --> journal_dirty_inode() --> metablob->add_primary_dentry() --> sr->encode(snapbl)

468     const sr_t *sr = in->get_projected_srnode();
469     if (sr)
470       sr->encode(snapbl);
471    
472     lump.nfull++;
473     lump.add_dfull(dn->get_name(), dn->get_alternate_name(), dn->first, dn->last,
474                    dn->get_projected_version(), pi, in->dirfragtree,
475                    in->get_projected_xattrs(), in->symlink, in->oldest_snap, snapbl,
476                    state, in->get_old_inodes());

If there have enough snapshots the size of this MDLog entry could be very large, such as around 4MB, which will fill a single MDLog segment:

2023-01-12T14:38:17.897+0000 7f640ca32700  5 mds.1.log trim already expired segment 589291537/8928452893621, 3 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339159/8932332493501, 34 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339193/8932336745429, 32 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339225/8932343168103, 32 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339257/8932345000282, 29 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339286/8932349277599, 53 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339339/8932353793091, 32 events
2023-01-12T14:38:17.898+0000 7f640ca32700  5 mds.1.log trim already expired segment 589339371/8932359183592, 35 events

And this could make the MDLog couldn't be trimmed in time and sharply increase the segment number quickly:

2023-01-12T14:40:47.900+0000 7f640ca32700 10 mds.1.log trim 21790 / 256 segments, 1782870 / -1 events, 1 (83) expiring, 21531 (1770973) expired
...
2023-01-12T12:29:42.713+0000 7f640ca32700 10 mds.1.log trim 17938 / 256 segments, 1589067 / -1 events, 1 (83) expiring, 17680 (1577245) expired

Actions

Copy link

Updated by Venky Shankar about 1 year ago

Priority changed from Normal to Low

Actions

Copy link

Updated by Venky Shankar about 1 year ago

Greg mentioned that this be worked on before its actually proved that this is causing slowness in the MDS - I agree. What we can start with is to add a perf counter that increments when the log event (esp. the subtreemap) size exceeds a threshold. That way on a problematic cluster, we can examine these counters to infer if the slowness is due to (re)logging large log events.

Actions

Copy link