Project

General

Profile

Actions

Bug #9341

closed

MDS: very slow rejoin

Added by Dmitry Smirnov over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had fiasco trying to use CephFS as network share: today restart of MDS (i.e. down time) took ~3 hours most of which MDS spent in "rejoin" state.
Slowness of CephFS is not new and to compensate for long down time during MSD restart I have two MDSes in "hot-standby" mode (1 up:active + 1 up:standby-replay + 1 up:standby).
The problem however is that as soon as active MDS is down (due to server reboot etc.) active-standby MDS takes over and rejoin for hours.
During rejoin CephFS is unavailable, mount points are not responding and OSDs are busy throwing "slow requests".
I only have ~700 GiB of data in CephFS (although number of files have greater impact on slowness than total size of data) and I fear that if I let data grow 10 times of what I have now I might be facing ~30 hours down time for every restart of active MDS which is just not acceptable...
My cluster is on 0.80.5 with 10 OSDs on 5 hosts connected by dual gigabit network (in "balance-rr" bonding mode).
OSDs are either hybrid SSHDs or rotational HDDs with journals on SSD. "metadata" pool size is only 126M.


Files

Actions

Also available in: Atom PDF