Bug #47843
mds: stuck in resolve when restarting MDS and reducing max_mds
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
In multi MDS ceph cluster, first reduce max_mds,before this step is completed, restart one or more MDS immediately. The restarted MDS will remain in the "resolve" or "rejoin" state
The produce steps are as follows:
1)There are 6 active MDS in ceph cluster(0,1,2,3,4,5)
2)set max_mds=3
3)restart mds.0,mds.1,mds.2
4)mds.2 remained in "reslove" state,Logs print:"still waiting for resolves (5)". Because mds.5 has stopped normally at this time.
Before:
[root@ceph1 ceph]# ceph fs status cephfs - 0 clients ====== +------+--------+---------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+--------+---------+---------------+-------+-------+ | 0 | active | ceph1-3 | Reqs: 0 /s | 30.3k | 77 | | 1 | active | ceph2-2 | Reqs: 0 /s | 253 | 13 | | 2 | active | ceph2 | Reqs: 0 /s | 462 | 13 | | 3 | active | ceph1-1 | Reqs: 0 /s | 10 | 13 | | 4 | active | ceph1 | Reqs: 0 /s | 10 | 13 | | 5 | active | ceph1-4 | Reqs: 0 /s | 0 | 0 | +------+--------+---------+---------------+-------+-------+ +------+----------+-------+-------+ | Pool | type | used | avail | +------+----------+-------+-------+ | meta | metadata | 382M | 565G | | data | data | 2070M | 565G | +------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ | ceph2-1 | | ceph2-4 | | ceph2-3 | +-------------+
Aefore reduce max_mds and restart some mds:
[root@ceph1 ceph]# ceph fs status cephfs - 0 clients ====== +------+---------+---------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos | +------+---------+---------+---------------+-------+-------+ | 0 | rejoin | ceph2-3 | | 30.2k | 14 | | 1 | rejoin | ceph2-4 | | 244 | 4 | | 2 | resolve | ceph2-1 | | 454 | 5 | | 3 | active | ceph1-1 | Reqs: 0 /s | 10 | 13 | | 4 | active | ceph1 | Reqs: 0 /s | 10 | 13 | +------+---------+---------+---------------+-------+-------+ +------+----------+-------+-------+ | Pool | type | used | avail | +------+----------+-------+-------+ | meta | metadata | 382M | 565G | | data | data | 2070M | 565G | +------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ | ceph1-4 | +-------------+
logs:
2020-10-09 16:05:09.894 7fcbf7d79700 1 mds.2.2407 handle_mds_map state change up:replay --> up:resolve 2020-10-09 16:05:09.894 7fcbf7d79700 1 mds.2.2407 resolve_start 2020-10-09 16:05:09.894 7fcbf7d79700 1 mds.2.2407 reopen_log 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache rollback_uncommitted_fragments: 0 pending 2020-10-09 16:05:09.894 7fcbf7d79700 7 mds.2.cache set_recovery_set 0,1,3,4,5 2020-10-09 16:05:09.894 7fcbf7d79700 1 mds.2.2407 recovery set is 0,1,3,4,5 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.2407 resolve set is 2 2020-10-09 16:05:09.894 7fcbf7d79700 7 mds.2.cache set_recovery_set 0,1,3,4,5 2020-10-09 16:05:09.894 7fcbf7d79700 1 mds.2.2407 recovery set is 0,1,3,4,5 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_slave_resolves 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_subtree_resolves 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache claim 0x102 [] 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.5 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.2407 resolve set is 0,1,2 2020-10-09 16:05:50.855 7fcbf7d79700 7 mds.2.cache set_recovery_set 0,1,3,4 2020-10-09 16:05:50.855 7fcbf7d79700 1 mds.2.2407 recovery set is 0,1,3,4 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_slave_resolves 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_subtree_resolves 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache claim 0x102 [] 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4 2020-10-09 16:05:50.856 7fcbf7d79700 7 mds.2.cache handle_resolve from mds.1 2020-10-09 16:05:50.856 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (0,3,4,5) 2020-10-09 16:05:50.860 7fcbf7d79700 7 mds.2.cache handle_resolve from mds.0 2020-10-09 16:05:50.860 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,4,5) 2020-10-09 16:05:50.885 7fcbf7d79700 7 mds.2.cache handle_resolve from mds.4 2020-10-09 16:05:50.885 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,5) 2020-10-09 16:05:50.913 7fcbf7d79700 7 mds.2.cache handle_resolve from mds.3 2020-10-09 16:05:50.914 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (5)
History
#1 Updated by Patrick Donnelly almost 3 years ago
- Subject changed from multimds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state to mds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state
- Description updated (diff)
- Status changed from New to Fix Under Review
- Assignee set to wei qiaomiao
- Target version set to v16.0.0
- Source set to Community (dev)
- Backport set to octopus,nautilus
- Pull request ID set to 37701
#2 Updated by Patrick Donnelly almost 3 years ago
- Subject changed from mds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state to mds: stuck in resolve when restarting MDS and reducing max_mds
#3 Updated by Patrick Donnelly over 2 years ago
- Target version changed from v16.0.0 to v17.0.0
- Backport changed from octopus,nautilus to pacific,octopus,nautilus
#4 Updated by Sage Weil over 2 years ago
- Project changed from Ceph to CephFS