Project

General

Profile

Bug #46976

After restarting an mds, its standy-replay mds remained in the "resolve" state

Added by wei qiaomiao over 3 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In multimds and standy-replay enabled Ceph cluster,after reduce a filesystem mds num and restart an active mds, its standy-replay mds didn't enter into active state and remained in the "resolve" state. This issue can reproduce by the following steps:
1.ceph fs set cephfs max_mds 6
2.ceph fs set cephfs allow_standby_replay true
3.ceph fs set cephfs max_mds 5 //reduce mds num
4.waiting for mds num reduce success
5.restart any active mds
6.ceph fs status

[root@host-192-168-10-241 ~]# ceph fs status
--------------------+----------------------+---------------+-------+-------+ | Rank | State | MDS | Activity | dns | inos |
--------------------+----------------------+---------------+-------+-------+ | 0 | resolve | host-192-168-5-105-4 | | 40.6k | 40.6k | | 1 | rejoin | host-192-168-5-101-9 | | 19.5k | 19.5k | | 2 | active | host-192-168-5-105-2 | Reqs: 0 /s | 55.6k | 55.6k | | 3 | active | host-192-168-5-101-3 | Reqs: 0 /s | 32.2k | 32.2k | | 4 | active | host-192-168-5-105-1 | Reqs: 0 /s | 16.4k | 16.4k | | 4-s | standby-replay | host-192-168-5-104-2 | Evts: 0 /s | 7527 | 7530 |
--------------------+----------------------+---------------+-------+-------+

log:
2020-08-13 08:50:48.901 7f223e02c700 10 mds.host-192-168-5-105-4 handle_mds_map: handling map as rank 0
2020-08-13 08:50:48.901 7f223e02c700 7 mds.0.tableserver(snaptable) handle_mds_recovery mds.1
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.4109 resolve set is 0,1
2020-08-13 08:50:48.901 7f223e02c700 7 mds.0.cache set_recovery_set 1,2,3,4
2020-08-13 08:50:48.901 7f223e02c700 1 mds.0.4109 recovery set is 1,2,3,4
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache send_slave_resolves
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache send_subtree_resolves
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache claim 0x1 [0x100010e40e4.000001*,0x100010e40e4.000010*,0x100010e40e4.000011*,0x100010e40e4.000000*,0x100010e40e4.000100*,0x100010e40e4.101101*,0x100010e40e4.101111*,0x100010e40e4.111111*,0x100010e40e4.000111*,0x100010e40e4.001000*,0x100010e40e4.001010*,0x100010e40e4.001011*,0x100010e40e4.001101*,0x100010e40e4.010000*,0x100010e40e4.010010*,0x100010e40e4.010101*,0x100010e40e4.010110*,0x100010e40e4.011001*,0x100010e40e4.011010*,0x100010e40e4.011011*,0x100010e40e4.011100*,0x100010e40e4.011110*,0x100010e40e4.100000*,0x100010e40e4.100001*,0x100010e40e4.100011*,0x100010e40e4.100110*,0x100010e40e4.100111*,0x100010e40e4.101001*,0x100010e40e4.101011*,0x100010e40e4.101100*,0x100010e40e4.101110*,0x100010e40e4.110000*,0x100010e40e4.110011*,0x100010e40e4.110110*,0x100010e40e4.110111*,0x100010e40e4.111000*,0x100010e40e4.111001*,0x100010e40e4.111110*]
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache claim 0x100 []
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache sending subtee resolve to mds.1
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache sending subtee resolve to mds.2
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache sending subtee resolve to mds.3
2020-08-13 08:50:48.901 7f223e02c700 10 mds.0.cache sending subtee resolve to mds.4
2020-08-13 08:50:48.902 7f223e02c700 7 mds.0.cache handle_resolve from mds.1
2020-08-13 08:50:48.902 7f223e02c700 10 mds.0.cache maybe_resolve_finish still waiting for resolves (2,3,4,5)
2020-08-13 08:50:48.902 7f223e02c700 7 mds.0.cache handle_resolve from mds.4
......
2020-08-13 08:50:48.904 7f223e02c700 10 mds.0.cache maybe_resolve_finish still waiting for resolves (5)


Related issues

Copied to CephFS - Backport #47089: octopus: After restarting an mds, its standy-replay mds remained in the "resolve" state Resolved
Copied to CephFS - Backport #47090: nautilus: After restarting an mds, its standy-replay mds remained in the "resolve" state Resolved

History

#1 Updated by Zheng Yan over 3 years ago

  • Assignee set to Zheng Yan

#2 Updated by Zheng Yan over 3 years ago

MDSRank::calc_recovery_set() should be called by MDSRank::resolve_start

#3 Updated by Zheng Yan over 3 years ago

  • Status changed from New to Fix Under Review
  • Assignee deleted (Zheng Yan)
  • Pull request ID set to 36632

#4 Updated by Patrick Donnelly over 3 years ago

  • Assignee set to wei qiaomiao
  • Target version set to v16.0.0
  • Source set to Community (dev)
  • Backport set to octopus,nautilus
  • Component(FS) MDS added

#5 Updated by Patrick Donnelly over 3 years ago

  • Status changed from Fix Under Review to Pending Backport

#6 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #47089: octopus: After restarting an mds, its standy-replay mds remained in the "resolve" state added

#7 Updated by Nathan Cutler over 3 years ago

  • Copied to Backport #47090: nautilus: After restarting an mds, its standy-replay mds remained in the "resolve" state added

#8 Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF