Project

General

Profile

Actions

Bug #62682

closed

mon: no mdsmap broadcast after "fs set joinable" is set to true

Added by Milind Changire 8 months ago. Updated 6 months ago.

Status:
Resolved
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDSMonitor
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

archive_path: /home/teuthworker/archive/mchangir-2023-08-09_06:54:05-fs:upgrade-wip-mchangir-testing-20230808.041738-testing-default-smithi/7364226

The command for fs set joinable true when executed by the mgr reaches the mon, but the mon fails to broadcast the mdsmap update leading to all mds remaining in up:standby for this specific run.

NOTE: This is an upgrade scenario

Here's the log from the mon which is handling the fs set joinable true command from the mgr:

2023-08-09T15:13:11.410+0000 7f062d02d700 10 mon.smithi125@0(leader).log v674 logging 2023-08-09T15:13:11.411369+0000 mon.smithi125 (mon.0) 679 : audit [INF] from='mgr.34104 172.21.15.125:0/679280427' entity='mgr.smithi125.nzjnwo' cmd=[{"prefix": "fs set", "fs_name": "cephfs", "var": "joinable", "val": "true"}]: dispatch


Related issues 3 (0 open3 closed)

Has duplicate CephFS - Bug #62848: qa: fail_fs upgrade scenario hangingDuplicatePatrick Donnelly

Actions
Copied to CephFS - Backport #63081: quincy: mon: no mdsmap broadcast after "fs set joinable" is set to trueResolvedPatrick DonnellyActions
Copied to CephFS - Backport #63082: reef: mon: no mdsmap broadcast after "fs set joinable" is set to trueResolvedPatrick DonnellyActions
Actions #1

Updated by Milind Changire 8 months ago

  • Severity changed from 3 - minor to 1 - critical
Actions #2

Updated by Venky Shankar 8 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Target version set to v19.0.0
  • Backport set to quincy,reef

The upgrade process uses `fail_fs` which fails the file system and upgrades the MDSs without reducing max_mds to 1. I debugged this a bit with Milind and it does seem like the MDS did not receive the updated map and failed to transition to a rank.

Actions #4

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #62863: Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail added
Actions #5

Updated by Venky Shankar 8 months ago

  • Related to Bug #62848: qa: fail_fs upgrade scenario hanging added
Actions #6

Updated by Patrick Donnelly 8 months ago

  • Related to deleted (Bug #62863: Slowness or deadlock in ceph-fuse causes teuthology job to hang and fail)
Actions #7

Updated by Venky Shankar 8 months ago

  • Priority changed from Normal to High
Actions #8

Updated by Patrick Donnelly 8 months ago

  • Priority changed from High to Normal

Milind Changire wrote:

[...]

The command for fs set joinable true when executed by the mgr reaches the mon, but the mon fails to broadcast the mdsmap update leading to all mds remaining in up:standby for this specific run.

The MDS do not receive an updated broadcast because they've not been assigned a new file system; i.e. they are up:standby.

The real question is why do the mons not assign any of the standbys to ranks.

NOTE: This is an upgrade scenario

Here's the log from the mon which is handling the fs set joinable true command from the mgr:
[...]

Few issues:

- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.
- The problem seems to be FSMap::get_available_standby is failing because:

https://github.com/ceph/ceph/blob/9fedc1e062027dbce66747e5d0dc11319615ab8a/src/mds/FSMap.cc#L769-L772

The recent addition of the minor log segment incompat bit caused that check to fail.

I'll work on a fix for the second issue.

Actions #9

Updated by Patrick Donnelly 8 months ago

  • Status changed from Triaged to Fix Under Review
  • Source set to Q/A
  • Severity changed from 1 - critical to 3 - minor
  • Pull request ID set to 53600
Actions #10

Updated by Patrick Donnelly 8 months ago

Patrick Donnelly wrote:

- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.

https://tracker.ceph.com/issues/62953

Actions #11

Updated by Venky Shankar 7 months ago

Patrick Donnelly wrote:

Patrick Donnelly wrote:

- This upgrade test is going from pacific to main. This is an N-3 to N upgrade.

https://tracker.ceph.com/issues/62953

Yeh, we discussed this in stand-up, upgrade needs to be from N-2 releases max. Jut for the record, this is a separate issue and not inducing the missing mdsmap update we are seeing in the failed test.

Actions #12

Updated by Milind Changire 7 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Backport Bot 7 months ago

  • Copied to Backport #63081: quincy: mon: no mdsmap broadcast after "fs set joinable" is set to true added
Actions #14

Updated by Backport Bot 7 months ago

  • Copied to Backport #63082: reef: mon: no mdsmap broadcast after "fs set joinable" is set to true added
Actions #15

Updated by Backport Bot 7 months ago

  • Tags set to backport_processed
Actions #16

Updated by Patrick Donnelly 7 months ago

  • Related to deleted (Bug #62848: qa: fail_fs upgrade scenario hanging)
Actions #17

Updated by Patrick Donnelly 7 months ago

  • Has duplicate Bug #62848: qa: fail_fs upgrade scenario hanging added
Actions #18

Updated by Venky Shankar 6 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF