Bug #47843: mds: stuck in resolve when restarting MDS and reducing max_mds - CephFS - Ceph

Bug #47843

Updated by Patrick Donnelly over 3 years ago

In multi MDS ceph cluster, first reduce max_mds，before this step is completed, restart one or more MDS immediately. The restarted MDS will remain in the "resolve" or "rejoin" state 
 The produce steps are as follows: 
 1)There are 6 active MDS in ceph cluster(0,1,2,3,4,5) 
 2)set max_mds=3 
 3)restart mds.0,mds.1,mds.2 
 4)mds.2 remained in "reslove" state，Logs print:"still waiting for resolves (5)". Because mds.5 has stopped normally at this time. 
 *Before:* 

 <pre> 
 [root@ceph1 ceph]# ceph fs status 
 cephfs - 0 clients 
 ====== 
 +------+--------+---------+---------------+-------+-------+ 
 | Rank | State    |     MDS     |      Activity     |    dns    |    inos | 
 +------+--------+---------+---------------+-------+-------+ 
 |    0     | active | ceph1-3 | Reqs:      0 /s | 30.3k |     77    | 
 |    1     | active | ceph2-2 | Reqs:      0 /s |    253    |     13    | 
 |    2     | active |    ceph2    | Reqs:      0 /s |    462    |     13    | 
 |    3     | active | ceph1-1 | Reqs:      0 /s |     10    |     13    | 
 |    4     | active |    ceph1    | Reqs:      0 /s |     10    |     13    | 
 |    5     | active | ceph1-4 | Reqs:      0 /s |      0    |      0    | 
 +------+--------+---------+---------------+-------+-------+ 
 +------+----------+-------+-------+ 
 | Pool |     type     |    used | avail | 
 +------+----------+-------+-------+ 
 | meta | metadata |    382M |    565G | 
 | data |     data     | 2070M |    565G | 
 +------+----------+-------+-------+ 
 +-------------+ 
 | Standby MDS | 
 +-------------+ 
 |     ceph2-1     | 
 |     ceph2-4     | 
 |     ceph2-3     | 
 +-------------+ 

 </pre> 

 
 * 
 Aefore reduce max_mds and restart some mds: 

 <pre> mds:* 
 [root@ceph1 ceph]# ceph fs status 
 cephfs - 0 clients 
 ====== 
 +------+---------+---------+---------------+-------+-------+ 
 | Rank |    State    |     MDS     |      Activity     |    dns    |    inos | 
 +------+---------+---------+---------------+-------+-------+ 
 |    0     |    rejoin | ceph2-3 |                 | 30.2k |     14    | 
 |    1     |    rejoin | ceph2-4 |                 |    244    |      4    | 
 |    2     | resolve | ceph2-1 |                 |    454    |      5    | 
 |    3     |    active | ceph1-1 | Reqs:      0 /s |     10    |     13    | 
 |    4     |    active |    ceph1    | Reqs:      0 /s |     10    |     13    | 
 +------+---------+---------+---------------+-------+-------+ 
 +------+----------+-------+-------+ 
 | Pool |     type     |    used | avail | 
 +------+----------+-------+-------+ 
 | meta | metadata |    382M |    565G | 
 | data |     data     | 2070M |    565G | 
 +------+----------+-------+-------+ 
 +-------------+ 
 | Standby MDS | 
 +-------------+ 
 |     ceph1-4     | 
 +-------------+ 
 </pre> 

 *logs:* 

 <pre> 
 2020-10-09 16:05:09.894 7fcbf7d79700    1 mds.2.2407 handle_mds_map state change up:replay --> up:resolve 
 2020-10-09 16:05:09.894 7fcbf7d79700    1 mds.2.2407 resolve_start 
 2020-10-09 16:05:09.894 7fcbf7d79700    1 mds.2.2407 reopen_log 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache rollback_uncommitted_fragments: 0 pending 
 2020-10-09 16:05:09.894 7fcbf7d79700    7 mds.2.cache set_recovery_set 0,1,3,4,5 
 2020-10-09 16:05:09.894 7fcbf7d79700    1 mds.2.2407    recovery set is 0,1,3,4,5 

 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.2407    resolve set is 2 
 2020-10-09 16:05:09.894 7fcbf7d79700    7 mds.2.cache set_recovery_set 0,1,3,4,5 
 2020-10-09 16:05:09.894 7fcbf7d79700    1 mds.2.2407    recovery set is 0,1,3,4,5 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_slave_resolves 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_subtree_resolves 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache    claim 0x102 [] 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4 
 2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.5 


 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.2407    resolve set is 0,1,2 
 2020-10-09 16:05:50.855 7fcbf7d79700    7 mds.2.cache set_recovery_set 0,1,3,4 
 2020-10-09 16:05:50.855 7fcbf7d79700    1 mds.2.2407    recovery set is 0,1,3,4 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_slave_resolves 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_subtree_resolves 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache    claim 0x102 [] 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3 
 2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4 

 2020-10-09 16:05:50.856 7fcbf7d79700    7 mds.2.cache handle_resolve from mds.1 
 2020-10-09 16:05:50.856 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (0,3,4,5) 
 2020-10-09 16:05:50.860 7fcbf7d79700    7 mds.2.cache handle_resolve from mds.0 
 2020-10-09 16:05:50.860 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,4,5) 
 2020-10-09 16:05:50.885 7fcbf7d79700    7 mds.2.cache handle_resolve from mds.4 
 2020-10-09 16:05:50.885 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,5) 
 2020-10-09 16:05:50.913 7fcbf7d79700    7 mds.2.cache handle_resolve from mds.3 
 2020-10-09 16:05:50.914 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (5) 
 </pre>

Back

Project

General

Profile

Ceph » CephFS

Bug #47843