Project

General

Profile

Actions

Bug #40002

open

mds: not trim log under heavy load

Added by Yunzhi Cheng almost 5 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Community (user)
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 14.2.1

we have 3 mds under a heavy load (create 8k files per second)

we find mds log add very fast almost 8k/s

At first mds log trim normally, after 10 minutes, the mds log stop trim and metadata pool grow very fast. Finally cause metadata pool full.


Files


Subtasks 1 (0 open1 closed)

Bug #53459: mds: start a new MDLog segment if new coming event possibly exceeds the expected segment sizeWon't FixXiubo Li

Actions

Related issues 2 (1 open1 closed)

Related to CephFS - Bug #52280: Mds crash and fails with assert on prepare_new_inodeResolvedXiubo Li

Actions
Related to CephFS - Bug #53542: Ceph Metadata Pool disk throughput usage increasingFix Under ReviewXiubo Li

Actions
Actions #1

Updated by Zheng Yan almost 5 years ago

multiple-active mds?

Actions #2

Updated by Yunzhi Cheng almost 5 years ago

Zheng Yan wrote:

multiple-active mds?

yes

Actions #3

Updated by Patrick Donnelly almost 5 years ago

  • Subject changed from mds not trim log under heavy load to mds: not trim log under heavy load
  • Target version changed from v14.2.1 to v15.0.0
  • Start date deleted (05/22/2019)
  • Source set to Community (user)
  • Backport set to nautilus,mimic
Actions #4

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions #5

Updated by Xiubo Li over 2 years ago

  • Assignee set to Xiubo Li
Actions #6

Updated by Xiubo Li over 2 years ago

  • Related to Bug #52280: Mds crash and fails with assert on prepare_new_inode added
Actions #7

Updated by Xiubo Li over 2 years ago

  • Status changed from New to In Progress
Actions #8

Updated by Xiubo Li over 2 years ago

The implementations of the Mutex (e.g. std::mutex in C++) do not guarantee fairness, they do not guarantee that the lock will be acquired by threads in the order that they called the lock().

In most case this works well, but in overload case the client requests handling thread and _submit_thread could always successfully acquire the submit_mutex in a long time, which could make the MDLog::trim() get stuck. That means the MDS daemons will fill journal logs into the metadata pool, but couldn't trim the expired segments in time.

We have hit this before in https://tracker.ceph.com/issues/51722, which is for `mds_lock`.

To fix it we will switch the `submit_mutex` to fair mutex to make sure all the `submit_mutex` waiters could get a change to be excuted in time.

Actions #9

Updated by Xiubo Li over 2 years ago

  • Backport changed from nautilus,mimic to pacific
Actions #10

Updated by Xiubo Li over 2 years ago

  • Affected Versions v17.0.0 added
  • Affected Versions deleted (v14.2.1)
Actions #11

Updated by Xiubo Li over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 44215
Actions #12

Updated by Xiubo Li over 2 years ago

  • Related to Bug #53542: Ceph Metadata Pool disk throughput usage increasing added
Actions #13

Updated by Xiubo Li over 2 years ago

There has one case that could lead the journal logs to fill the metadata pool full, such as in case of tracker #53597:

For example, if a ClientA keeps one directory opening and in another ClientB just deletes it, the MDS will try to delete it but will just stuck and keeps the related dentry in the stray directory by submitting one Update("unlink_local") journal log event to one log segmentA.

But the ClientA will hold some capabilities of the CInode, so the segmentA couldn't be successfully expired and trimmed until the ClientA release those capabilities. That means all the segments after segmentA couldn't be expired and trimmed too. Thus will cause the MDLog segments keep increasing and the usage of metadata pool keep increasing.

Actions #14

Updated by Xiubo Li over 2 years ago

Xiubo Li wrote:

There has one case that could lead the journal logs to fill the metadata pool full, such as in case of tracker #53597:

For example, if a ClientA keeps one directory opening and in another ClientB just deletes it, the MDS will try to delete it but will just stuck and keeps the related dentry in the stray directory by submitting one Update("unlink_local") journal log event to one log segmentA.

But the ClientA will hold some capabilities of the CInode, so the segmentA couldn't be successfully expired and trimmed until the ClientA release those capabilities. That means all the segments after segmentA couldn't be expired and trimmed too. Thus will cause the MDLog segments keep increasing and the usage of metadata pool keep increasing.

The above case is not correct.

I have reproduced it locally by:

1, set max_mds = 2

2, mount a fuse client

3, then create a lot of directories and distributed pin it to mds.1 at the same time by using:

   $ mkdir lxb ; setfattr -n ceph.dir.pin -v 1 lxb

   $ for i in {0..2500}; do mkdir lxb/dir_$i; setfattr -n ceph.dir.pin -v 1 lxb/dir_$i; done

4, then run:

    $ ceph fs status

    a - 2 clients
    =
    RANK   STATE    MDS     ACTIVITY     DNS    INOS   DIRS CAPS
      0     active    c   Reqs:    0 /s   579    582    111 261
      1    stopping   b                  2012   2015   2014 3464
          POOL        TYPE     USED  AVAIL
     cephfs.a.meta  metadata  90.9M  98.9G
     cephfs.a.data    data       0   98.9G
     STANDBY MDS
              a

It shows the usage of the metadata pool is 90.9MB, it's should be normal.

5, then set max_mds = 1 and then run:

    $ ceph fs status

    a - 2 clients
    =
    RANK   STATE    MDS     ACTIVITY     DNS    INOS   DIRS CAPS
     0     active    c   Reqs:    0 /s  2012   2015   2014 3464
     1    stopping   b                  1613   1616   1615 4
         POOL        TYPE     USED  AVAIL
    cephfs.a.meta  metadata  8488M  96.2G
    cephfs.a.data    data       0   96.2G
    STANDBY MDS
         a

The usage of the metada pool grows sharply and at last it will be 28GB:

    RANK  STATE   MDS     ACTIVITY     DNS    INOS   DIRS   CAPS
     0    active   a   Reqs:    0 /s  2513   2516   2515      2
         POOL        TYPE     USED  AVAIL
    cephfs.a.meta  metadata  28.0G  89.5G
    cephfs.a.data    data       0   89.5G
    STANDBY MDS
         b

6, From cephfs-journal-tool:

  $ cephfs-journal-tool --rank=a:0 header get
    {
        "magic": "ceph fs volume v011",
        "write_pos": 9554043702,
        "expire_pos": 4194304,
        "trimmed_pos": 4194304,
        "stream_format": 1,
        "layout": {
            "stripe_unit": 4194304,
            "stripe_count": 1,
            "object_size": 4194304,
            "pool_id": 2,
            "pool_ns": "" 
        }
    }

The write_pos is in the 2278th Rados object, while the trimmed_pos keeps in the second object.

7, From:

    $ rados -p cephfs.a.meta ls | grep '200\.00' | wc -l

     2397

If I do the above steps again and again the metadata pool usage keeps growing as above.

Checked the mds.0 logs:

...
563970 2021-12-15T17:48:41.429+0800 146cba420700 10 mds.0.log submit_entry also starting new segment: last = 3354/42835772, event seq = 3672           
563971 2021-12-15T17:48:41.429+0800 146cba420700  7 mds.0.log _prepare_new_segment seq 3673
...
572886 2021-12-15T17:48:43.060+0800 146cba420700 10 mds.0.log submit_entry also starting new segment: last = 3673/18446744073709551615, event seq = 4696                                                                                                                                               
572887 2021-12-15T17:48:43.060+0800 146cba420700  7 mds.0.log _prepare_new_segment seq 4697
...
583368 2021-12-15T17:48:45.105+0800 146cba420700 10 mds.0.log submit_entry also starting new segment: last = 4697/18446744073709551615, event seq = 5720                                                                                                                                               
583369 2021-12-15T17:48:45.105+0800 146cba420700  7 mds.0.log _prepare_new_segment seq 5721
...
609918 2021-12-15T17:49:03.406+0800 146cba420700 10 mds.0.log submit_entry also starting new segment: last = 5721/1573670016, event seq = 6638         
609919 2021-12-15T17:49:03.406+0800 146cba420700  7 mds.0.log _prepare_new_segment seq 6639
...

From above we can see that why starting the two new segments with seqs `4697` and `5721` is that the `ls->num_event >= 1024`, the `1024` is the `mds_log_events_per_segment`.

But from the 4th segment's `ls->offset` - 1st segment's `ls->offset`, the range is `(1573670016−42835772)/4MB = 364.9` Rados Objects, that means for the above two new segments with seqs `4697` and `5721`, their average size could be `364.9 x 4MB / 2 = 1.42GB / 2 = 0.71GB`. Just in this case the total journal logs could occupy up to 0.71GB x 128 = 91.2GB metadata pool space, the `128` is `mds_log_max_segments`.

What has caused this ?

Currently for the mds/MDLog, the `submit_mutex` is an unfair mutex. The implementations of the Mutex (e.g. std::mutex in C++) do not guarantee fairness, they do not guarantee that the lock will be acquired by threads in the order that they called the lock().

In most case this works well, but in heavy load case the client requests handling thread and _submit_thread could always successfully acquire the submit_mutex in a long time, which could make the MDLog::trim() get stuck. That means the MDS daemons will fill journal logs into the metadata pool, but couldn't trim the expired segments in time. Or the _submit_entry could always successfully acquire the submit_mutex, the _submit_thread thread will stuck for a while.

For the second case above could cause the current LogSegment's num_event growing but the ls->offset couldn't be updated by _submit_thread until later it could acquire the submit_mutex. From the logs we can see that some events size, such as the ESubtreeMap, could reach up to 1MB:

657940 2021-12-15T17:49:39.389+0800 146cb8a13700  5 mds.0.log _submit_thread 2480027743~41 : EImportFinish 0x200000000a3 success
657941 2021-12-15T17:49:39.400+0800 146cb8a13700  5 mds.0.log _submit_thread 2480027804~919866 : ESubtreeMap 1046 subtrees , 1044 ambiguous [metablob 0x1, 1047 dirs]
657942 2021-12-15T17:49:39.401+0800 146cb8a13700  5 mds.0.log _submit_thread 2480947690~41 : EImportFinish 0x2000000006c success
657943 2021-12-15T17:49:39.410+0800 146cb8a13700  5 mds.0.log _submit_thread 2480947751~918987 : ESubtreeMap 1045 subtrees , 1043 ambiguous [metablob 0x1, 1046 dirs]

So in worse case the event's size could be larger than 1MB, such as 20MB. That means the LogSegment size could reach up to `20MB x 1024 = 20GB`. So the journal logs could occupy `20GB x 128 = 2560GB` disk space from the metadata pool.

So after swtiching the submit_mutex to a fair mutex [1] and with the improving PR [2], it will be much better.

[1] : https://github.com/ceph/ceph/pull/44215
[2] : https://github.com/ceph/ceph/pull/44180

Actions #15

Updated by Xiubo Li over 2 years ago

More logs:

 8614 2021-12-16T10:02:42.596+0800 14c148b75700  5 mds.0.log _submit_thread 2829812324~1725281 : ESubtreeMap 1653 subtrees , 1651 ambiguous [metablob 0x1, 1973 dirs]
 8615 2021-12-16T10:02:42.597+0800 14c148b75700  5 mds.0.log _submit_thread 2831537625~520 : EImportStart 0x2000000000f from mds.1 [metablob 0x2000000000f, 1 dirs]
 8616 2021-12-16T10:02:42.624+0800 14c148b75700  5 mds.0.log _submit_thread 2831538165~1724401 : ESubtreeMap 1652 subtrees , 1650 ambiguous [metablob 0x1, 1972 dirs]

The `ESubtreeMap` event will be inserted to each new LogSegment when starting a new one. The above logs was reproduced by creating 2000 diretories by using:

# ceph fs set a max_mds 2
# mkdir lxb; setfattr -n ceph.dir.pin -v 1 lxb
# for i in {0..2000}; do mkdir lxb/dir2_$i; setfattr -n ceph.dir.pin -v 1 lxb/dir2_$i; done
# ceph fs set a max_mds 1

If I create more directories the size of the `ESubtreeMap` event will reach up to more than 4MB, that means each LogSegment will only contain the `ESubtreeMap` event, that's all. So this is buggy.

Actions

Also available in: Atom PDF