Feature #58488
mds: avoid encoding srnode for each ancestor in an EMetaBlob log event
100%
Description
This happens via MDCache::predirty_journal_parents() where MDCache::journal_dirty_inode() is called for each ancestor in the path hierarchy for dentry modification operations.
for (const auto& in : lsi) { journal_dirty_inode(mut.get(), blob, in); }
In EMetaBlob::add_primary_dentry():
bufferlist snapbl; const sr_t *sr = in->get_projected_srnode(); if (sr) sr->encode(snapbl); lump.nfull++; lump.add_dfull(dn->get_name(), dn->get_alternate_name(), dn->first, dn->last, dn->get_projected_version(), pi, in->dirfragtree, in->get_projected_xattrs(), in->symlink, in->oldest_snap, snapbl, state, in->get_old_inodes());
It seems that the srnode is encoded and persisted multiple times in the (EMetaBlob) log event. Persisting this once (for the log event) could reduce the size of the log event. This may need changes to replay and commiting code to teach it to look for the "singleton" srnode in the log event.
Subtasks
History
#1 Updated by Xiubo Li 8 months ago
More detail about this:
For example for /AAAA/BBBB/CCCC/ we create snapshots under /* and */AAAA/BBBB/, and later if we create a file under for /AAAA/BBBB/CCCC/ and when submitting the MDLog event for this creation it will predirty all the parents:
predirty_journal_parents frag->inode on [dir 0x10000000002 /AAAA/BBBB/CCCC/ [2,head] predirty_journal_parents frag->inode on [dir 0x10000000001 /AAAA/BBBB/ [2,head] -->snapralm2 predirty_journal_parents frag->inode on [dir 0x10000000000 /AAAA/ [2,head] predirty_journal_parents frag->inode on [dir 0x1 / [2,head] --> snaprealm1
And the predirty_journal_parents() will encode both the snaprealm1 and snaprealm2 the snapshots to the MDLog entries:
predirty_journal_parents() --> journal_dirty_inode() --> metablob->add_primary_dentry() --> sr->encode(snapbl) 468 const sr_t *sr = in->get_projected_srnode(); 469 if (sr) 470 sr->encode(snapbl); 471 472 lump.nfull++; 473 lump.add_dfull(dn->get_name(), dn->get_alternate_name(), dn->first, dn->last, 474 dn->get_projected_version(), pi, in->dirfragtree, 475 in->get_projected_xattrs(), in->symlink, in->oldest_snap, snapbl, 476 state, in->get_old_inodes());
If there have enough snapshots the size of this MDLog entry could be very large, such as around 4MB, which will fill a single MDLog segment:
2023-01-12T14:38:17.897+0000 7f640ca32700 5 mds.1.log trim already expired segment 589291537/8928452893621, 3 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339159/8932332493501, 34 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339193/8932336745429, 32 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339225/8932343168103, 32 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339257/8932345000282, 29 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339286/8932349277599, 53 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339339/8932353793091, 32 events 2023-01-12T14:38:17.898+0000 7f640ca32700 5 mds.1.log trim already expired segment 589339371/8932359183592, 35 events
And this could make the MDLog couldn't be trimmed in time and sharply increase the segment number quickly:
2023-01-12T14:40:47.900+0000 7f640ca32700 10 mds.1.log trim 21790 / 256 segments, 1782870 / -1 events, 1 (83) expiring, 21531 (1770973) expired ... 2023-01-12T12:29:42.713+0000 7f640ca32700 10 mds.1.log trim 17938 / 256 segments, 1589067 / -1 events, 1 (83) expiring, 17680 (1577245) expired
#2 Updated by Venky Shankar 8 months ago
- Priority changed from Normal to Low
#3 Updated by Venky Shankar 8 months ago
Greg mentioned that this be worked on before its actually proved that this is causing slowness in the MDS - I agree. What we can start with is to add a perf counter that increments when the log event (esp. the subtreemap) size exceeds a threshold. That way on a problematic cluster, we can examine these counters to infer if the slowness is due to (re)logging large log events.
#4 Updated by Patrick Donnelly 8 days ago
- Target version changed from v18.0.0 to v19.0.0