Project

General

Profile

Bug #47843

mds: stuck in resolve when restarting MDS and reducing max_mds

Added by wei qiaomiao over 3 years ago. Updated almost 3 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
pacific,octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In multi MDS ceph cluster, first reduce max_mds,before this step is completed, restart one or more MDS immediately. The restarted MDS will remain in the "resolve" or "rejoin" state
The produce steps are as follows:
1)There are 6 active MDS in ceph cluster(0,1,2,3,4,5)
2)set max_mds=3
3)restart mds.0,mds.1,mds.2
4)mds.2 remained in "reslove" state,Logs print:"still waiting for resolves (5)". Because mds.5 has stopped normally at this time.
Before:

[root@ceph1 ceph]# ceph fs status
cephfs - 0 clients
======
+------+--------+---------+---------------+-------+-------+
| Rank | State  |   MDS   |    Activity   |  dns  |  inos |
+------+--------+---------+---------------+-------+-------+
|  0   | active | ceph1-3 | Reqs:    0 /s | 30.3k |   77  |
|  1   | active | ceph2-2 | Reqs:    0 /s |  253  |   13  |
|  2   | active |  ceph2  | Reqs:    0 /s |  462  |   13  |
|  3   | active | ceph1-1 | Reqs:    0 /s |   10  |   13  |
|  4   | active |  ceph1  | Reqs:    0 /s |   10  |   13  |
|  5   | active | ceph1-4 | Reqs:    0 /s |    0  |    0  |
+------+--------+---------+---------------+-------+-------+
+------+----------+-------+-------+
| Pool |   type   |  used | avail |
+------+----------+-------+-------+
| meta | metadata |  382M |  565G |
| data |   data   | 2070M |  565G |
+------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph2-1   |
|   ceph2-4   |
|   ceph2-3   |
+-------------+

Aefore reduce max_mds and restart some mds:

[root@ceph1 ceph]# ceph fs status
cephfs - 0 clients
======
+------+---------+---------+---------------+-------+-------+
| Rank |  State  |   MDS   |    Activity   |  dns  |  inos |
+------+---------+---------+---------------+-------+-------+
|  0   |  rejoin | ceph2-3 |               | 30.2k |   14  |
|  1   |  rejoin | ceph2-4 |               |  244  |    4  |
|  2   | resolve | ceph2-1 |               |  454  |    5  |
|  3   |  active | ceph1-1 | Reqs:    0 /s |   10  |   13  |
|  4   |  active |  ceph1  | Reqs:    0 /s |   10  |   13  |
+------+---------+---------+---------------+-------+-------+
+------+----------+-------+-------+
| Pool |   type   |  used | avail |
+------+----------+-------+-------+
| meta | metadata |  382M |  565G |
| data |   data   | 2070M |  565G |
+------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
|   ceph1-4   |
+-------------+

logs:

2020-10-09 16:05:09.894 7fcbf7d79700  1 mds.2.2407 handle_mds_map state change up:replay --> up:resolve
2020-10-09 16:05:09.894 7fcbf7d79700  1 mds.2.2407 resolve_start
2020-10-09 16:05:09.894 7fcbf7d79700  1 mds.2.2407 reopen_log
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache rollback_uncommitted_fragments: 0 pending
2020-10-09 16:05:09.894 7fcbf7d79700  7 mds.2.cache set_recovery_set 0,1,3,4,5
2020-10-09 16:05:09.894 7fcbf7d79700  1 mds.2.2407  recovery set is 0,1,3,4,5

2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.2407  resolve set is 2
2020-10-09 16:05:09.894 7fcbf7d79700  7 mds.2.cache set_recovery_set 0,1,3,4,5
2020-10-09 16:05:09.894 7fcbf7d79700  1 mds.2.2407  recovery set is 0,1,3,4,5
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_slave_resolves
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache send_subtree_resolves
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache  claim 0x102 []
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4
2020-10-09 16:05:09.894 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.5

2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.2407  resolve set is 0,1,2
2020-10-09 16:05:50.855 7fcbf7d79700  7 mds.2.cache set_recovery_set 0,1,3,4
2020-10-09 16:05:50.855 7fcbf7d79700  1 mds.2.2407  recovery set is 0,1,3,4
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_slave_resolves
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache send_subtree_resolves
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache  claim 0x102 []
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.0
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.1
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.3
2020-10-09 16:05:50.855 7fcbf7d79700 10 mds.2.cache sending subtee resolve to mds.4

2020-10-09 16:05:50.856 7fcbf7d79700  7 mds.2.cache handle_resolve from mds.1
2020-10-09 16:05:50.856 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (0,3,4,5)
2020-10-09 16:05:50.860 7fcbf7d79700  7 mds.2.cache handle_resolve from mds.0
2020-10-09 16:05:50.860 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,4,5)
2020-10-09 16:05:50.885 7fcbf7d79700  7 mds.2.cache handle_resolve from mds.4
2020-10-09 16:05:50.885 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (3,5)
2020-10-09 16:05:50.913 7fcbf7d79700  7 mds.2.cache handle_resolve from mds.3
2020-10-09 16:05:50.914 7fcbf7d79700 10 mds.2.cache maybe_resolve_finish still waiting for resolves (5)

History

#1 Updated by Patrick Donnelly over 3 years ago

  • Subject changed from multimds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state to mds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state
  • Description updated (diff)
  • Status changed from New to Fix Under Review
  • Assignee set to wei qiaomiao
  • Target version set to v16.0.0
  • Source set to Community (dev)
  • Backport set to octopus,nautilus
  • Pull request ID set to 37701

#2 Updated by Patrick Donnelly over 3 years ago

  • Subject changed from mds: restart an mds after reduce max_mds, the restartd mds will remained in "resolve" or "rejoin" state to mds: stuck in resolve when restarting MDS and reducing max_mds

#3 Updated by Patrick Donnelly about 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport changed from octopus,nautilus to pacific,octopus,nautilus

#4 Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to CephFS

Also available in: Atom PDF