Bug #62682
closedmon: no mdsmap broadcast after "fs set joinable" is set to true
0%
Description
archive_path: /home/teuthworker/archive/mchangir-2023-08-09_06:54:05-fs:upgrade-wip-mchangir-testing-20230808.041738-testing-default-smithi/7364226
The command for fs set joinable true when executed by the mgr reaches the mon, but the mon fails to broadcast the mdsmap update leading to all mds remaining in up:standby for this specific run.
NOTE: This is an upgrade scenario
Here's the log from the mon which is handling the fs set joinable true command from the mgr:
2023-08-09T15:13:11.410+0000 7f062d02d700 10 mon.smithi125@0(leader).log v674 logging 2023-08-09T15:13:11.411369+0000 mon.smithi125 (mon.0) 679 : audit [INF] from='mgr.34104 172.21.15.125:0/679280427' entity='mgr.smithi125.nzjnwo' cmd=[{"prefix": "fs set", "fs_name": "cephfs", "var": "joinable", "val": "true"}]: dispatch
Updated by Milind Changire 8 months ago
- Severity changed from 3 - minor to 1 - critical
Updated by Venky Shankar 8 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Patrick Donnelly
- Target version set to v19.0.0
- Backport set to quincy,reef
The upgrade process uses `fail_fs` which fails the file system and upgrades the MDSs without reducing max_mds to 1. I debugged this a bit with Milind and it does seem like the MDS did not receive the updated map and failed to transition to a rank.
Updated by Venky Shankar 8 months ago
Updated by Patrick Donnelly 8 months ago
- Related to Bug #62863: Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail added
Updated by Venky Shankar 8 months ago
- Related to Bug #62848: qa: fail_fs upgrade scenario hanging added
Updated by Patrick Donnelly 8 months ago
- Related to deleted (Bug #62863: Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail)
Updated by Patrick Donnelly 8 months ago
- Priority changed from High to Normal
Milind Changire wrote:
[...]
The command for fs set joinable true when executed by the mgr reaches the mon, but the mon fails to broadcast the mdsmap update leading to all mds remaining in up:standby for this specific run.
The MDS do not receive an updated broadcast because they've not been assigned a new file system; i.e. they are up:standby.
The real question is why do the mons not assign any of the standbys to ranks.
NOTE: This is an upgrade scenario
Here's the log from the mon which is handling the fs set joinable true command from the mgr:
[...]
Few issues:
- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.
- The problem seems to be FSMap::get_available_standby is failing because:
The recent addition of the minor log segment incompat bit caused that check to fail.
I'll work on a fix for the second issue.
Updated by Patrick Donnelly 8 months ago
- Status changed from Triaged to Fix Under Review
- Source set to Q/A
- Severity changed from 1 - critical to 3 - minor
- Pull request ID set to 53600
Updated by Patrick Donnelly 8 months ago
Patrick Donnelly wrote:
- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.
Updated by Venky Shankar 7 months ago
Patrick Donnelly wrote:
Patrick Donnelly wrote:
- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.
Yeh, we discussed this in stand-up, upgrade needs to be from N-2 releases max. Jut for the record, this is a separate issue and not inducing the missing mdsmap update we are seeing in the failed test.
Updated by Milind Changire 7 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 7 months ago
- Copied to Backport #63081: quincy: mon: no mdsmap broadcast after "fs set joinable" is set to true added
Updated by Backport Bot 7 months ago
- Copied to Backport #63082: reef: mon: no mdsmap broadcast after "fs set joinable" is set to true added
Updated by Patrick Donnelly 7 months ago
- Related to deleted (Bug #62848: qa: fail_fs upgrade scenario hanging)
Updated by Patrick Donnelly 7 months ago
- Has duplicate Bug #62848: qa: fail_fs upgrade scenario hanging added
Updated by Venky Shankar 6 months ago
- Status changed from Pending Backport to Resolved