Project

General

Profile

Actions

Bug #47582

closed

MDS failover takes 10-15 hours: Ceph MDS stays in "up:replay" state for hours

Added by Heilig IOS over 3 years ago. Updated over 3 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have 9 nodes Ceph cluster. Ceph version is 15.2.5. The cluster has 175 OSD (HDD) + 3 NVMe for cache tier for "cephfs_data" pool. CephFS pools info:

POOL                    ID  STORED   OBJECTS  USED     %USED  MAX AVAIL
cephfs_data              1  350 TiB  179.53M  350 TiB  66.93     87 TiB
cephfs_metadata          3  3.1 TiB   17.69M  3.1 TiB   1.77     87 TiB

We use multiple active MDS instances: 3 "active" and 3 "standby". Each MDS server has 128GB RAM, "mds cache memory limit" = 64GB.

Failover to a standby MDS instance takes 10-15 hours. CephFS is unreachable for the clients all this time. The MDS instance just stays in "up:replay" state during the failover for all this time.
It looks like MDS demon checking all of the folders:

2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EOpen.replay 
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay 3 dirlumps by unknown.0
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay dir 0x300000041c5
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x300000041c5 /repository/files/14/ [2,head] auth v=2070324 cv=0/0 state=1610612737|complete f(v0 m2020-09-10T13:05:29.297254-0700 515=0+515) n(v46584 rc2020-09-21T20:38:49.071043-0700 b3937793650802 1056114=601470+454644) hs=515+0,ss=0+0 dirty=75 | child=1 subtree=0 dirty=1 0x55d4c9359b80]
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay for [2,head] had [dentry #0x1/repository/files/14/14119 [2,head] auth (dversion lock) v=2049516 ino=0x30000812e2f state=1073741824 | inodepin=1 0x55db2463a1c0]
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 0x30000812e2f [...2,head] /repository/files/14/14119/ auth fragtree_t(*^3) v2049516 f(v0 m2020-09-18T10:17:53.379121-0700 13498=0+13498) n(v6535 rc2020-09-19T05:52:25.035403-0700 b272027384385 112669=81992+30677) (iversion lock) | dirfrag=8 0x55db24643000]
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay dir 0x30000812e2f.000*
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x30000812e2f.000* /repository/files/14/14119/ [2,head] auth v=77082 cv=0/0 state=1073741824 f(v0 m2020-09-18T10:17:53.371122-0700 1636=0+1636) n(v6535 rc2020-09-19T05:51:18.063949-0700 b33321023818 13707=9986+3721) hs=885+0,ss=0+0 | child=1 0x55db845bf080]
2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay added (full) [dentry #0x1/repository/files/14/14119/39823 [2,head] auth NULL (dversion lock) v=0 ino=(nil) state=1073741888|bottomlru 0x55d82061a900]

We have millions folders with millions of small files. When the folders/subfolders scan is done, CephFS is active again.
We tried standby-replay and it helps but doesn't eliminate the root cause.

Actions #1

Updated by Zheng Yan over 3 years ago

were there "mds behind on trim" warning

Actions #2

Updated by Heilig IOS over 3 years ago

I have this issue right now. No, there is no "mds behind on trim" warning.

Actions #3

Updated by Zheng Yan over 3 years ago

what is the value of "mds log max segments" config

Actions #4

Updated by Heilig IOS over 3 years ago

Current value: mds_log_max_segments = 100000

Actions #5

Updated by Zheng Yan over 3 years ago

Heilig IOS wrote:

Current value: mds_log_max_segments = 100000

that's the root cause. the value should be smaller than 1000

Actions #6

Updated by Heilig IOS over 3 years ago

I decreased it with these commands:

ceph tell mds.* injectargs '--mds_log_max_segments=250'
ceph tell mds.* injectargs '--mds_log_max_expiring=200'

I have "mds behind on trim" warning now:

[WRN] MDS_TRIM: 3 MDSs behind on trimming
    mds1(mds.0): Behind on trimming (60410/250) max_segments: 250, num_segments: 60410
    mds2(mds.1): Behind on trimming (10195/250) max_segments: 250, num_segments: 10195

The affected MDS daemon is checking subfolders:

2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) Checking dentry 0x5558f0567200
2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) keeping [dir 0x3000081ac32 /repository/content/14/14000/48906/ [2,head] auth v=29 cv=0/0 dir_auth=1 state=1073741824 f(v0 m2020-09-10T13:34:44.304865-0700 1=0+1) n(v2 rc2020-09-14T16:04:37.570052-0700 b955844 3=2+1) hs=0+0,ss=0+0 | subtree=1 0x5558f2817080]
2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) Checking dentry 0x5558f05ed8c0
2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) keeping [dir 0x3000081ad5b /repository/content/14/14000/49677/ [2,head] auth v=38 cv=0/0 dir_auth=1 state=1073741824 f(v0 m2020-09-10T13:40:08.148212-0700 2=0+2) n(v2 rc2020-09-14T16:04:38.362053-0700 b6009 4=2+2) hs=0+0,ss=0+0 | subtree=1 0x5558f2863600]
2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) Checking dentry 0x5558f0612f40
2020-09-22T07:01:40.985-0700 7fad887c4700 10 mds.1.cache trim_non_auth_subtree(0x5558f0246100) keeping [dir 0x3000081adc8 /repository/content/14/14000/49953/ [2,head] auth v=21 cv=0/0 dir_auth=1 state=1073741824 f(v0 m2020-09-10T13:41:29.820031-0700 1=0+1) n(v2 rc2020-09-14T16:04:38.666054-0700 b160315 2=1+1) hs=0+0,ss=0+0 | subtree=1 0x5558f287d600]

And it is still in up:replay state after 5 minutes. I don't see any difference for now.

Actions #7

Updated by Heilig IOS over 3 years ago

Still no changes. The "mds_log_max_segments" didn't help. The MDS failover is running for 30 minutes already. What else can I check?

Actions #8

Updated by Patrick Donnelly over 3 years ago

  • Status changed from New to Rejected
  • Target version deleted (v15.2.5)

(This discussion should move to ceph-users.)

Actions #9

Updated by Patrick Donnelly over 3 years ago

Heilig IOS wrote:

Still no changes. The "mds_log_max_segments" didn't help. The MDS failover is running for 30 minutes already. What else can I check?

Decreasing this config will only affect the up:active operation. It won't help with failover when the MDS journal is already very large. Once the failover completes the MDS will begin trimming its journal so future recovery is faster.

Actions #10

Updated by Heilig IOS over 3 years ago

Thank you for the provided information.

I will test the MDS failover in a day. Quick question regarding "mds_log_max_segments" parameter. Changing value from 100000 to 250 increased amount of operations but decreased R/W bandwidth in times.
For example, it was "250 MiB/s rd, 300 MiB/s wr" with "mds_log_max_segments = 100000", but it is just "5 MiB/s rd, 10 MiB/s wr" now ("mds_log_max_segments = 250"). Is there any setting which can improve situation without touching "mds_log_max_segments"?

Actions

Also available in: Atom PDF