Project

General

Profile

Actions

Bug #40001

closed

mds cache oversize after restart

Added by Yunzhi Cheng almost 5 years ago. Updated almost 4 years ago.

Status:
Rejected
Priority:
High
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
multimds
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 14.2.1

we have 3 mds under a heavy load (create 8k files per second)

all 3 mds are under 30G memory limit, when I restart 1 of them, the standby mds is doing rejoin but it's memory grows very large, finally get almost 70G.

I have 1000 dir and every dir have 70000 files

Actions #1

Updated by Yunzhi Cheng almost 5 years ago

I set debug_mds to 20/20 and almost all of the log is like

2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) _fetched pos 217 marker 'I' dname 'abcd.ab.csv.gz [2,head]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup (head, 'abcd.ab.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*)   miss -> (a.T.quote.csv.gz,head)
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup_exact_snap (head, 'abcd.ab.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) add_primary_dentry [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/abcd.ab.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f9b state=1073741824 0x1431b72c0]
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) _fetched  got [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/abcd.ab.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f9b state=1073741824 0x1431b72c0] [inode 0x300001f5f9b [2,head] /aaa/bbb/ccc/ddd/eee/fff/2016/12/07/abcd.ab.csv.gz auth v42918 s=17078 n(v0 rc2019-05-22 13:22:53.975211 b17078 1=1+0) (iversion lock) 0x1431ba700]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) _fetched pos 216 marker 'I' dname 'bbbb.bb.csv.gz [2,head]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup (head, 'bbbb.bb.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*)   miss -> (b.T.quote.csv.gz,head)
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup_exact_snap (head, 'abcd.KQ.others.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) add_primary_dentry [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/bbbb.bb.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f8f state=1073741824 0x1431b74a0]
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) _fetched  got [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/bbbb.bb.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f8f state=1073741824 0x1431b74a0] [inode 0x300001f5f8f [2,head] /aaa/bbb/ccc/ddd/eee/fff/2016/12/07/bbbb.bb.csv.gz auth v42102 s=1878 n(v0 rc2019-05-22 13:22:53.219209 b1878 1=1+0) (iversion lock) 0x1431bae00]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) _fetched pos 215 marker 'I' dname 'xxxx.xx.corr.csv.gz [2,head]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup (head, 'xxxx.xx.corr.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*)   miss -> (c.KS.corr.csv.gz,head)
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup_exact_snap (head, 'abcd.KS.corr.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) add_primary_dentry [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/xxxx.xx.corr.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f95 state=1073741824 0x1431b7680]
2019-05-22 23:32:13.588 7f4c9f624700 12 mds.0.cache.dir(0x100012566e9.010*) _fetched  got [dentry #0x1/aaa/bbb/ccc/ddd/eee/fff/2016/12/07/xxxx.xx.corr.csv.gz [2,head] auth (dversion lock) pv=0 v=178232 ino=0x300001f5f95 state=1073741824 0x1431b7680] [inode 0x300001f5f95 [2,head] /aaa/bbb/ccc/ddd/eee/fff/2016/12/07/xxxx.xx.corr.csv.gz auth v25438 s=200 n(v0 rc2019-05-22 13:22:53.227209 b200 1=1+0) (iversion lock) 0x1431bb500]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) _fetched pos 214 marker 'I' dname 'abcd.KS.csv.gz [2,head]
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup (head, 'abcd.KS.csv.gz')
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*)   miss -> (d.T.auct.csv.gz,head)
2019-05-22 23:32:13.588 7f4c9f624700 20 mds.0.cache.dir(0x100012566e9.010*) lookup_exact_snap (head, 'xxxx.xx.csv.gz')
Actions #2

Updated by Patrick Donnelly almost 5 years ago

  • Priority changed from Normal to High
  • Target version changed from v14.2.1 to v15.0.0
  • Start date deleted (05/22/2019)
  • Backport set to nautilus
  • ceph-qa-suite deleted (fs)
  • Labels (FS) multimds added

Are you using snapshots? Can you tell us more about how the cluster is being used like # of clients and versions.

Actions #3

Updated by Yunzhi Cheng almost 5 years ago

Patrick Donnelly wrote:

Are you using snapshots? Can you tell us more about how the cluster is being used like # of clients and versions.

I'm not using snapshots.

ceph -s:

cluster:
id: f41c780b-a413-4db5-8bc3-2cd7e81bc275
health: HEALTH_OK
services:
mon: 3 daemons, quorum rndcl94,rndcl106,rndcl154 (age 3d)
mgr: rndcl154(active, since 8d), standbys: rndcl106, rndcl94
mds: cephfs:3 {0=rndcl94=up:active,1=rndcl118=up:active,2=rndcl154=up:active} 1 up:standby
osd: 24 osds: 24 up (since 7d), 24 in (since 7d)
data:
pools: 3 pools, 385 pgs
objects: 123.24M objects, 1.8 TiB
usage: 27 TiB used, 47 TiB / 75 TiB avail
pgs: 381 active+clean
4 active+clean+scrubbing+deep

All the client are kernel client and kernel version is 4.14.35-041435

Client System is Ubuntu 14.04 and Server System is Ubuntu 16.04

Actions #4

Updated by Zheng Yan almost 5 years ago

please if these dirfrag fetches are from open_file_table

Actions #5

Updated by Yunzhi Cheng almost 5 years ago

Zheng Yan wrote:

please if these dirfrag fetches are from open_file_table

How can I figure out if they are from open_file_table?

Actions #6

Updated by Patrick Donnelly over 4 years ago

  • Target version deleted (v15.0.0)
Actions #7

Updated by Milind Changire almost 4 years ago

Yunzhi,
What is the value of the config option 'mds_cache_memory_limit' on the system ?
Are you referring to this option when referring to MDS being unser 30G memory limit ?

Actions #8

Updated by Milind Changire almost 4 years ago

  • Assignee set to Milind Changire
Actions #9

Updated by Patrick Donnelly almost 4 years ago

  • Status changed from New to Rejected

This ticket has become stale. Closing.

Actions

Also available in: Atom PDF