Project

General

Profile

Bug #9341

MDS: very slow rejoin

Added by Dmitry Smirnov over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had fiasco trying to use CephFS as network share: today restart of MDS (i.e. down time) took ~3 hours most of which MDS spent in "rejoin" state.
Slowness of CephFS is not new and to compensate for long down time during MSD restart I have two MDSes in "hot-standby" mode (1 up:active + 1 up:standby-replay + 1 up:standby).
The problem however is that as soon as active MDS is down (due to server reboot etc.) active-standby MDS takes over and rejoin for hours.
During rejoin CephFS is unavailable, mount points are not responding and OSDs are busy throwing "slow requests".
I only have ~700 GiB of data in CephFS (although number of files have greater impact on slowness than total size of data) and I fear that if I let data grow 10 times of what I have now I might be facing ~30 hours down time for every restart of active MDS which is just not acceptable...
My cluster is on 0.80.5 with 10 OSDs on 5 hosts connected by dual gigabit network (in "balance-rr" bonding mode).
OSDs are either hybrid SSHDs or rotational HDDs with journals on SSD. "metadata" pool size is only 126M.

History

#1 Updated by Greg Farnum over 9 years ago

I wonder if this is the same as some of our own recent tickets (see #9264 and #8962). What size journal does the MDS have, and how many inodes are pinned in cache?

#2 Updated by Zheng Yan over 9 years ago

how may data files do you have? I guess the MDS is doing backpointer lookup, we can make the client first trim its caps, then reconnect to the MDS.

#3 Updated by Zheng Yan over 9 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

#4 Updated by Dmitry Smirnov over 9 years ago

To the best of my knowledge this has nothing to do with #9264 and #8962.
Here is MDS restart log:

2014-09-04 09:42:17.634587 7f7c7a98f700  1 mds.0.0 standby_replay_restart (as standby)
2014-09-04 09:42:17.636862 7f7c7ca94700  1 mds.0.0 replay_done (as standby)
2014-09-04 09:42:18.636939 7f7c7a98f700  1 mds.0.0 standby_replay_restart (as standby)
2014-09-04 09:42:18.965300 7f7c7ca94700  1 mds.0.0 replay_done (as standby)
2014-09-04 09:42:19.739000 7f7c7ca94700  1 mds.0.832 handle_mds_map i am now mds.0.832
2014-09-04 09:42:19.739001 7f7c7ca94700  1 mds.0.832 handle_mds_map state change up:standby-replay --> up:replay
2014-09-04 09:42:19.965370 7f7c7a98f700  1 mds.0.832 standby_replay_restart (as standby)
2014-09-04 09:42:19.968054 7f7c7ca94700  1 mds.0.832 replay_done (as standby)
2014-09-04 09:42:19.968057 7f7c7ca94700  1 mds.0.832 standby_replay_restart (final takeover pass)
2014-09-04 09:42:19.968059 7f7c7ca94700  1 mds.0.832  waiting for osdmap 66346 (which blacklists prior instance)
2014-09-04 09:42:20.189686 7f7c7ca94700  1 mds.0.832 replay_done
2014-09-04 09:42:20.189688 7f7c7ca94700  1 mds.0.832 making mds journal writeable
2014-09-04 09:42:21.007406 7f7c7ca94700  1 mds.0.832 handle_mds_map i am now mds.0.832
2014-09-04 09:42:21.007411 7f7c7ca94700  1 mds.0.832 handle_mds_map state change up:replay --> up:reconnect
2014-09-04 09:42:21.007414 7f7c7ca94700  1 mds.0.832 reconnect_start
2014-09-04 09:42:21.007416 7f7c7ca94700  1 mds.0.832 reopen_log
2014-09-04 09:42:21.026681 7f7c7ca94700  1 mds.0.server reconnect_clients -- 4 sessions
2014-09-04 09:42:21.033127 7f7c7847f700  0 -- 192.168.0.2:6800/28856 >> 192.168.0.7:0/566894525 pipe(0x7f7c89279180 sd=26 :6800 s=0 pgs=0 cs=0 l=0 c=0x7f7c864e82c0).accept peer addr is really 192.168.0.7:0/566894525 (socket is 192.168.0.7:43583/0)
2014-09-04 09:42:21.266324 7f7c7ca94700  0 log [DBG] : reconnect by client.9381018 192.168.0.204:0/6276 after 0.227250
2014-09-04 09:42:21.276778 7f7c7ca94700  0 log [DBG] : reconnect by client.10456039 192.168.0.7:0/566894525 after 0.250073
2014-09-04 09:42:22.951129 7f7c7ca94700  0 log [DBG] : reconnect by client.10275606 192.168.0.2:0/9234 after 1.924421
2014-09-04 09:42:24.523247 7f7c7ca94700  0 log [DBG] : reconnect by client.7755150 192.168.0.6:0/1819 after 3.496546
2014-09-04 09:42:24.582446 7f7c7ca94700  1 mds.0.832 reconnect_done
2014-09-04 09:42:25.707546 7f7c7ca94700  1 mds.0.832 handle_mds_map i am now mds.0.832
2014-09-04 09:42:25.707550 7f7c7ca94700  1 mds.0.832 handle_mds_map state change up:reconnect --> up:rejoin
2014-09-04 09:42:25.707552 7f7c7ca94700  1 mds.0.832 rejoin_start
2014-09-04 09:42:35.053804 7f7c7ca94700  1 mds.0.832 rejoin_joint_start
2014-09-04 09:42:59.353437 7f7c7978b700  0 -- 192.168.0.2:6800/28856 >> 192.168.0.204:0/6276 pipe(0x7f7c89258000 sd=25 :6800 s=2 pgs=11317 cs=1 l=0 c=0x7f7c84637340).fault with nothing to send, going to standby
2014-09-04 09:57:18.224970 7f7c77c77700  0 -- 192.168.0.2:6800/28856 >> 192.168.0.250:6813/12818 pipe(0x7f7c89239400 sd=28 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d354d91e0).fault
2014-09-04 10:12:05.509808 7f7c77974700  0 -- 192.168.0.2:6800/28856 >> 192.168.0.2:6806/30203 pipe(0x7f7d31dae500 sd=30 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7d3e700420).fault
2014-09-04 12:30:47.116721 7f7c7ca94700  1 mds.0.832 rejoin_done
2014-09-04 12:30:47.898798 7f7c7ca94700  1 mds.0.832 handle_mds_map i am now mds.0.832
2014-09-04 12:30:47.898804 7f7c7ca94700  1 mds.0.832 handle_mds_map state change up:rejoin --> up:active
2014-09-04 12:30:47.898810 7f7c7ca94700  1 mds.0.832 recovery_done -- successful recovery!
2014-09-04 12:30:48.832845 7f7c7ca94700  1 mds.0.832 active_start
2014-09-04 12:30:48.896953 7f7c7ca94700  1 mds.0.832 cluster recovered.

How to check MDS journal size? How to get "how many inodes are pinned in cache"?

The particular MDS is configured with

    mds cache size             = 250000
    mds mem max                = 4194304     # KB

another one have:

    mds cache size             = 1250000
    mds mem max                = 8388608     # KB

I started playing with "cache size" and "mem max" in hope to accelerate restart and/or operations but I couldn't feel an improvement from changing those settings.

There were ~633000 files at the time of last MDS restart.

Now there are only 224655 files (7484 directories) left after clean-up and full directory walk ("tree -a /mnt/ceph") takes 79 seconds.

#5 Updated by Greg Farnum over 9 years ago

If you set the "mds dump cache after rejoin" config option to true it will dump out the cache to a file which we can parse (after finishing rejoin, obviously). If you turn up debug logging during the restart it will print the journal information, or it can probably be retrieved via one of the journal tools, but I don't remember what's in firefly and what's not.

This sort of debugging is much improved in the next release and we're continuing to work on it, fyi.

#6 Updated by Sage Weil over 9 years ago

  • Status changed from New to Need More Info

#7 Updated by Zheng Yan over 9 years ago

are you using kernel client? If you are, please try the attached patch. I hope it will improve rejoin speed.

#8 Updated by Dmitry Smirnov over 9 years ago

Zheng Yan wrote:

are you using kernel client? If you are, please try the attached patch. I hope it will improve rejoin speed.

Most of CephFS clients are fuse-driven but few are kernel clients.
I'd love to try your patch but 0.80.5 do not have "fs/ceph/mds_client.c" file. :(

#9 Updated by Zheng Yan over 9 years ago

that patch is for kernel client. here is the patch for ceph-fuse

#10 Updated by Dmitry Smirnov over 9 years ago

I re-built and re-deployed ceph with fuse patch; re-configured all kernel clients to use fuse client; re-mounted CephFS on all clients and restarted MDSes. Earlier I removed about three quarters of files but rejoin completed much faster than I expected -- within seconds instead of minutes:

2014-09-11 23:55:38.592971 7f7a23718700  1 mds.0.835 handle_mds_map state change up:reconnect --> up:rejoin
2014-09-11 23:55:38.592973 7f7a23718700  1 mds.0.835 rejoin_start
2014-09-11 23:55:38.593119 7f7a23718700  1 mds.0.835 rejoin_joint_start
2014-09-11 23:55:38.601407 7f7a23718700  1 mds.0.835 rejoin_done
2014-09-11 23:55:40.085693 7f7a23718700  1 mds.0.835 handle_mds_map i am now mds.0.835
2014-09-11 23:55:40.085701 7f7a23718700  1 mds.0.835 handle_mds_map state change up:rejoin --> up:active
2014-09-11 23:55:40.085707 7f7a23718700  1 mds.0.835 recovery_done -- successful recovery!
2014-09-11 23:55:40.086065 7f7a23718700  1 mds.0.835 active_start
2014-09-11 23:55:40.086364 7f7a23718700  1 mds.0.835 cluster recovered.

Well done and thank you very much for this dramatic improvement, Zheng.
Please backport.

#11 Updated by John Spray over 9 years ago

The userspace change and test for this are merged into master. Is the kernel side all done too?

#12 Updated by Dmitry Smirnov over 9 years ago

John Spray wrote:

The userspace change and test for this are merged into master. Is the kernel side all done too?

Sorry I had no time to try custom kernel(s) so I'm not sure. So far I've just eliminated kernel clients and replaced 'em with fuse clients. I won't be able to try custom kernel any time soon on live environment...

Many thanks for this fix -- it was tremendously helpful, at least its Fuse side.

#13 Updated by Dmitry Smirnov over 9 years ago

  • Status changed from Need More Info to Fix Under Review
  • Severity changed from 2 - major to 1 - critical

Please include this fix to 0.87 which is affected just as badly as 0.80.x.
On 0.87 MDS stuck in "rejoin" for hours again -- so painful... :(

#14 Updated by Greg Farnum over 9 years ago

  • Status changed from Fix Under Review to Pending Backport

Hmm, we didn't put this in Giant initially because we were trying not to perturb it. Master hasn't been run through the nightlies too often since then so I don't feel great putting it in now either, but we can probably backport it for the follow-up point release.
Unless the RADOS team tells us it's going to be a while before the final Giant tag and we can put it in the nightlies long enough for release, but I hope we don't have that long to wait. ;)

#15 Updated by Dmitry Smirnov over 9 years ago

Greg Farnum wrote:

Hmm, we didn't put this in Giant initially because we were trying not to perturb it. Master hasn't been run through the nightlies too often since then so I don't feel great putting it in now either, but we can probably backport it for the follow-up point release.
Unless the RADOS team tells us it's going to be a while before the final Giant tag and we can put it in the nightlies long enough for release, but I hope we don't have that long to wait. ;)

So basically what you're saying is that this issue was not forgotten but deliberately withheld to avoid delaying release of Giant? That's exactly why releasing on schedule is so harmful in software development! We may disagree on severity of this bug but if issue like this that is causing significant downtime is not important enough to delay release then I don't know what kind of problem would qualify... I would delay so called "stable" release to accommodate this fix without hesitation...

#16 Updated by Greg Farnum over 9 years ago

  • Status changed from Pending Backport to Resolved

This is backported to giant as of commit:97e423f52155e2902bf265bac0b1b9ed137f8aa0. The test for it also got backported in ceph-qa-suite.

#17 Updated by Dmitry Smirnov over 9 years ago

Thanks.

Also available in: Atom PDF