Bug #52874
closedMonitor might crash after upgrade from ceph to 16.2.6
0%
Description
The following assertion might pop up
void FSMap::sanity() const
{
...
if (info.state != MDSMap::STATE_STANDBY_REPLAY) {
...
} else {
ceph_assert(fs->mds_map.allows_standby_replay());
}
when allow-standby-replay flag is set to false but some MDS-es are still running in standby-replay mode.
The thing is that prior to Pacific setting the flag doesn't enforce MDS going out of the mode.
Hence one might put the cluster (and relevant MDS map) in an inconsistent state which triggers the monitor assertion on the upgrade.
Neither upgrade manual requires manual standby-replay MDS disablement PRIOR to monitor upgrade. According to the spec the latter to be performed at stage 2 while actions on MDS are at stage 5:
2.Upgrade monitors by installing the new packages and restarting the monitor daemons. For example, on each monitor host,:
...
5. Upgrade all CephFS MDS daemons. For each CephFS file system,
1. Disable standby_replay:
Updated by Patrick Donnelly over 2 years ago
- Status changed from New to Triaged
- Priority changed from Normal to Urgent
- Target version set to v17.0.0
- Source set to Community (user)
- Backport set to pacific
- Component(FS) MDSMonitor added
Updated by Patrick Donnelly over 2 years ago
You can get around this problem by setting in ceph.conf (for the mons):
[mon] mon_mds_skip_sanity = true
Thanks for the helpful bug report, I will work on a fix.
Updated by Patrick Donnelly over 2 years ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 43508
- Labels (FS) crash added
Updated by Patrick Donnelly over 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 2 years ago
- Copied to Backport #52998: pacific: Monitor might crash after upgrade from ceph to 16.2.6 added
Updated by Loïc Dachary over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".