Project

General

Profile

Actions

Bug #64298

open

CephFS metadata pool has large OMAP objects corresponding to strays

Added by Alexander Patrakov 3 months ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello developers,

A customer has a cluster which currently has 4 large OMAP objects (one old and three new) in its metadata pool. I am aware of https://tracker.ceph.com/issues/45333, and in this comment https://tracker.ceph.com/issues/45333#note-6 a procedure for triggering the directory fragmentation exists: reconstruct the directory path and list that directory to get it fragmented. However, in our case, this procedure is inapplicable.

# rados getxattr --pool=mainfs.meta 100290d9cb3.00000000 parent | ceph-dencoder type inode_backtrace_t import - decode dump_json
{
    "ino": 1100200385715,
    "ancestors": [
        {
            "dirino": 1543,
            "dname": "100290d9cb3",
            "version": 318702055
        },
        {
            "dirino": 256,
            "dname": "stray7",
            "version": 1405762425
        }
    ],
    "pool": 2,
    "old_pools": []
}

See - it is a stray. Actually, all three new large OMAP objects correspond to stray directories that for this reason cannot be listed. Instructions should be provided on how to deal with this situation.

Regarding possible snapshots: the oldest snapshot of a directory that "officially" should have snapshots is dated January 28, 2024. There might be older snapshots of other directories, I have not searched for them and I don't know if they exist.

Regarding the contents of one of the stray objects, I did this to get some statistics:

# ceph tell mds.0 dump tree "~mdsdir/stray7" > stray7.json
# ls -l stray7.json 
-rw-r--r-- 1 root root 710084873 Feb  2 08:25 stray7.json
# wc -l stray7.json 
23391176 stray7.json
# grep stray_prior_path stray7.json | wc -l
135172
# grep stray_prior_path stray7.json | grep -v '"stray_prior_path": ""' | wc -l
358

I can confirm that the entries with non-empty stray_prior_path are "clustered" in two different directories. I have checked one entry manually - it does not exist as either a file or a directory, but its parent does and contains a lot of existing subdirectories named in a similar way.


Files

snaps.json (2.61 KB) snaps.json Alexander Patrakov, 03/21/2024 06:55 PM
Actions #1

Updated by Alexander Patrakov 3 months ago

Would it be a good idea to perform a CephFS scrub as described here? https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub

If so - would it be OK to do it while there are still production jobs using CephFS in their usual way? Do I need to take any precautions?

Actions #2

Updated by Alexander Patrakov 3 months ago

Update: according to the logs, the metadata pool was resized from 64 to 256 PGs a few days ago. One of the four large OMAP health warnings (the old one) was therefore bogus since after the resize, and was cleared by deep-scrubbing its PG. So, what remains is three large OMAPs, all of them on stray directories.

Actions #3

Updated by Venky Shankar 3 months ago

  • Assignee set to Kotresh Hiremath Ravishankar
  • Target version set to v19.0.0

Kotresh, please RCA.

Actions #4

Updated by Alexander Patrakov 3 months ago

Update: now there are 14 such objects. All strays.

The total number of strays in this cluster is:

# ceph tell mds.0 perf dump | grep stray
        "num_strays": 1410467,
        "num_strays_delayed": 28,
        "num_strays_enqueuing": 0,
        "strays_created": 524821989,
        "strays_enqueued": 523963811,
        "strays_reintegrated": 184441,
        "strays_migrated": 0,

Actions #5

Updated by Kotresh Hiremath Ravishankar 3 months ago

Alexander Patrakov wrote:

Would it be a good idea to perform a CephFS scrub as described here? https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub

If so - would it be OK to do it while there are still production jobs using CephFS in their usual way? Do I need to take any precautions?

I am still investigating what might have caused this. Could you share the following information ?

1. How many snapshots are present ? 'ceph tell mds.<rank> dump snaps' output would be helpful.
2. Is this multi mds active setup (max_mds) ? 'ceph fs dump' output

You could try scrubbing mdsdir but that might affect the production I/O if there are lot of stray entries which is true in this case. So I would suggest to do while production jobs aren't there.
But there was an issue of mds crash while scrubbing mdsdir which has been fixed. You have to make sure the fix (https://github.com/ceph/ceph/pull/50815)
for https://tracker.ceph.com/issues/51824 is present before attempting.

Thanks,
Kotresh H R

Actions #6

Updated by Alexander Patrakov about 1 month ago

Hi Kotresh,

I missed the notification, sorry for that.

Regarding your requests: ceph tell mds.0 dump snaps - see the attached file.

Multi MDS is not active and was never active.

Some of those large OMAP objects have disappeared by themselves, two remain, and there are nine new large OMAP objects that do not correspond to strays. We tried to apply the procedure from https://tracker.ceph.com/issues/45333#note-6 to the non-strays, however, the problem is that the directories referenced in the output of `rados getxattr ...` no longer exist.

We cannot scrub the mdsdir yet because the schedule to update to a Ceph version that fixes #51824 has not yet been approved.

The owner of the data in this cluster provided some insightful comments.

For the original issue with strays:

We had a few workloads that were quite heavy on the file-server in the last few months. They have finished and I have deleted a lot of files (~200 million) that were needed temporarily in that workload. Nothing should have been writing into the folders I was deleting. The files I deleted should be really deleted once the snapshots from end of January / early February expire.

For the new large OMAP objects that are not strays:

All these folders had their contents moved/deleted

Regards,
Alexander Patrakov

Actions

Also available in: Atom PDF