Project

General

Profile

Actions

Bug #4405

closed

MDCache::populate_mydir can loop forever

Added by Ivan Kudryavtsev about 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
dumpling
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had unusual MDS failure. My server NIC started to flap and as a result (finally)
my CEPH FS started to recover and check as if new "cold" MDS was started.
Configuration is 1 active and 2 standby MDS servers.

And it recovery about 2 days already (I enabled debug mds = 10 in config). The two questions more:

1) is it ok, that recovery of 200 GB takes about 2 days (traffic is about 30MB/s to MDS but load is low and It seems to me that speed should be higher)?
2) how can I increase the speed of recovery? Actually, why should it recover?

I run ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)

Actions #1

Updated by Ivan Kudryavtsev about 11 years ago

regarding q2:

I increased mds cache size to
mds cache size = 100000000
and it started in seconds.

I don't know how it's connected, but I believe it should be pointed in docs.

Actions #2

Updated by Ivan Kudryavtsev about 11 years ago

I think it's important to specify some kind of metrics so everyone could calculate memory utilization of specific cache size. For example, I have 12GB of RAM, so how big mds cache I can have?

Moreover, If It's so important to MDS then may be it should be connected with some backing storage like BerkeleyDB or something like this in order make system work even if RAM is small.

Actions #3

Updated by Ivan Kudryavtsev about 11 years ago

Actually, regarding initial ticket message. I think MDS goes in some kind of LOOP during start, when cache size is small and amount of files is high. I don't know how to explain in another way the time to start took two days and 5-10 seconds.

Actions #4

Updated by Ivan Kudryavtsev about 11 years ago

I mounted ceph root and counted amount of files and it's less than default cache size of 100000

s:/mnt/ceph# find | wc
  64397   64397 5936240

Actions #5

Updated by Sage Weil about 11 years ago

If the stuck startup is reproducible now (by lowering the cache size and restarting), a log with debug ms =1 and debug mds = 20 would tell us exactly what the problem is. It shouldn't get stuck like that.

Also, yes, allowing memory to be tuned more sanely would definitely be a good thing, and is on the list. :)

Actions #7

Updated by Ivan Kudryavtsev about 11 years ago

Log is done when it was stuck last time. I stopped MDS, increased log level and started again.

Actions #8

Updated by Greg Farnum about 11 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to High
Actions #9

Updated by Greg Farnum about 11 years ago

  • Assignee changed from Sage Weil to Greg Farnum
Actions #10

Updated by Sage Weil about 11 years ago

Hi Ivan-

Looking at the log, it looks like all 3 times the MDS started up it came up within 5 seconds or so. Do you have, or can you capture, a log that covers the period where it wouldn't come up? (Or was that what you saw when you were generating this log?)

Thanks!

Actions #11

Updated by Ivan Kudryavtsev about 11 years ago

OK, I don't know what do you mean under "start" term. But actually, all the time MDS run with
debug ms =1 and debug mds = 20 noone client was able to connect it and it's much more than 5 seconds, you can find from log. For me, start is the time from launch till the clients are able to connect. So from the first launch with debugging until the third launch without, MDS was unavailable. Next, I increased mds cache size = 100000000 and it started to handle clients. That's what I mean.

During the previous settings launch it writes messages like this:


2013-03-10 13:53:42.737964 7f0d204f8700 10 mds.0.journal EMetaBlob.replay added [inode 100002b5ceb [2,head] /hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/database/user-elements.php auth v12 s=240 n(v0 b240 1=1+0) (iversion lock) cr={10839=0-4194304@1} 0x42eeba0]
2013-03-10 13:53:42.737973 7f0d204f8700 10 mds.0.journal EMetaBlob.replay added [dentry #1/hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/database/clone.php [2,head] auth NULL (dversion lock) v=10 inode=0 0x43a1950]
2013-03-10 13:53:42.737988 7f0d204f8700 10 mds.0.journal EMetaBlob.replay added [inode 100002b5cec [2,head] /hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/database/clone.php auth v10 s=1742 n(v0 b1742 1=1+0) (iversion lock) cr={10839=0-4194304@1} 0x42ef400]
2013-03-10 13:53:42.738002 7f0d204f8700 10 mds.0.journal EMetaBlob.replay dir 100002b5cee
2013-03-10 13:53:42.738005 7f0d204f8700 10 mds.0.cache force_dir_fragment * on [inode 100002b5cee [...2,head] /hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/change-password/ auth v224 f(v0 m2013-02-21 18:25:06.743720 2=2+0) n(v0 rc2013-02-21 18:25:06.743720 b712 3=2+1) (iversion lock) 0x42e54e0]
2013-03-10 13:53:42.738020 7f0d204f8700 10 mds.0.cache force_dir_fragment no frags under *
2013-03-10 13:53:42.738030 7f0d204f8700 10 mds.0.journal EMetaBlob.replay added dir [dir 100002b5cee /hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/change-password/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x41cdac8]
2013-03-10 13:53:42.738039 7f0d204f8700 10 mds.0.journal EMetaBlob.replay updated dir [dir 100002b5cee /hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/change-password/ [2,head] auth v=9 cv=0/0 state=1073741824 f(v0 m2013-02-21 18:25:06.743720 2=2+0) n(v0 rc2013-02-21 18:25:06.743720 b712 2=2+0) hs=0+0,ss=0+0 0x41cdac8]
2013-03-10 13:53:42.738065 7f0d204f8700 10 mds.0.journal EMetaBlob.replay added [dentry #1/hosting/cloud-1/configuration/opt/psa/admin/smb/application/resources/languages/zh-TW/components/forms/change-password/change.php [2,head] auth NULL (dversion lock) v=6 inode=0 0x43a1b28]

to log and doesn't handle clients.

Actions #12

Updated by Ivan Kudryavtsev about 11 years ago

And what's interesting all the time MDS server has incoming traffic of ~40MB/s, but no active clients. I found it after the flap of my NIC, and next found during every start until increased cache.

Actions #13

Updated by Greg Farnum about 11 years ago

  • Subject changed from MDS failure caused Filesystem recovery to MDCache::populate_mydir can loop forever
  • Priority changed from High to Normal

Sorry this got dropped on the floor. I found the problems.

The MDS never finishes the "populate_mydir()" function call; it keeps waiting for stuff and then being recalled pretty continuously. I see that trim thinks it has ~400k dentries in cache and is limited to 100k so I think it's reading in too-large directories and then trimming before populate_mydir gets to do anything, and that kicks off another read which kicks off another trim.
Meanwhile, since that never finishes, the clients never get answers to their questions, but the MDS is heartbeating the monitors so it doesn't get killed.

What's your workload on this cluster? Have you used snapshots?
I ask because it looks like you've got 49439 "stray" inodes (inodes which have been deleted but that might still be open somewhere). Of those 49436 are directories...how does that even happen... Sage suggests our directory deletion might just be broken. It's certainly not optimal in other ways but in skimming it I don't see any obvious breakage regarding stray directory purging.

(Lowering to "normal" as there's a workaround for now.)

Actions #14

Updated by Ivan Kudryavtsev about 11 years ago

1) I don't use filesystem snapshots at all.
2) I really have 3 big directories with 40000 files total
3) Some days ago, I've tried to copy our corporate /home and mailboxes to Ceph with export to NFS and got a lot of hangs. I wrote about it in #3935 and as I think it could be only way to get 49436 directories, because current ceph fs doesn't have such amount of directories. After /home and mailboxes were removed from ceph, fs were umounted multiple times with clients reboots and so on.

If you will give me the glue to find something, I will do it for you for sure.

Actions #15

Updated by Greg Farnum about 11 years ago

If you run "ceph mds 0 dumpcache <filename>" then the MDS will dump everything it has in cache to the filename you specify. If you can attach that here then we can see how many inodes it's got in cache now, and if there are still a bunch of stray inodes then that will contain clues about why.

Actions #16

Updated by Ivan Kudryavtsev about 11 years ago

root@node-04:~# ceph mds 0 dumpcache mds-0.dump
unrecognized command
root@tsk-vps-node-04:~# ceph -v
ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
root@node-04:~# 

Actions #17

Updated by Ivan Kudryavtsev about 11 years ago

Ok, I did

ceph mds tell 0 dumpcache /tmp/dump.txt

http://91.226.13.93/dump.txt.gz

Actions #18

Updated by Greg Farnum about 11 years ago

This dump has 1063591 inodes in the cache, of which only 122104 are non-stray. That doesn't seem quite right.

I do see that on a spot-check all the stray inodes have their version locks held, but not apparently by a client; my guess is that we have a leak somewhere. Going to spend a little time looking into that now.

Actions #19

Updated by Greg Farnum almost 11 years ago

  • Assignee deleted (Greg Farnum)
Actions #20

Updated by Greg Farnum over 10 years ago

  • Status changed from New to Fix Under Review
  • Assignee set to Zheng Yan
  • Backport set to dumpling

He's got a patch in wip-4405 that he thinks would solve the loading problem; I think he might be right. The other issues will get their own tickets if we see them again (there have been a bunch of fixes since then!).

Actions #21

Updated by Sage Weil over 10 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #22

Updated by Greg Farnum over 10 years ago

  • Status changed from Pending Backport to Resolved

Cherry-picked to dumpling branch in commit:299ddd31b29e332dc5e76bc4f871e4769698665d

Actions

Also available in: Atom PDF