Bug #65265: qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs - CephFS - Ceph

Actions

Copy link

Bug #65265

open

qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs

Added by Rishabh Dave about 1 month ago. Updated 6 days ago.

Status:

Fix Under Review

Priority:

Urgent

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Tags:

Backport:

quincy,reef,squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

mgr/nfs

Labels (FS):

Pull request ID:

56944

Crash signature (v1):

Crash signature (v2):

Description

Link to the job - https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/

The tests (qa/tasks/cephfs/test_nfs.py) ran successfully but the job failed due to the unexpected health warnings -

2024-03-27T06:38:24.458 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

This health warning occurred 4 times in total, 2 times before tes_nfs.py started running and 2 times after test_nfs.py finished running and never during test_nfs.py was running.

Warning 1, line 11268 - 2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
Warning 2, line 23642 - 2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.832342+0000 mon.a (mon.0) 342 : cluster [INF] Health check cleared: MGR_DOWN (was: no active mgr)
Then cluster becomes healthy, from line 23643 -

2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.832393+0000 mon.a (mon.0) 343 : cluster [INF] Cluster is now healthy
2024-03-27T06:07:49.277 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:48 smithi184 bash[21504]: cluster 2024-03-27T06:07:47.836380+0000 mon.a (mon.0) 344 : cluster [DBG] mgrmap e20: x(active, star

Tests start running, line 42136 -

2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)

Cluster is health again when tests are at the end of test_nfs.py, from line 231178 -

2024-03-27T06:32:18.776 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:32:18 smithi184 bash[21504]: cluster 2024-03-27T06:32:17.531023+0000 mon.a (mon.0) 3332 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)
2024-03-27T06:32:18.776 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:32:18 smithi184 bash[21504]: cluster 2024-03-27T06:32:17.531067+0000 mon.a (mon.0) 3333 : cluster [INF] Cluster is now healthy

Tests finish running, line 247158 - 2024-03-27T06:37:26.372 INFO:tasks.cephfs_test_runner:Ran 30 tests in 1796.992s

Warning 3, line 247231 - 2024-03-27T06:38:24.458 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)
Warning 4, line 247236 - 2024-03-27T06:38:24.673 INFO:teuthology.orchestra.run.smithi184.stdout:2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

From /a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/remote/smithi184/log/2d1fee3e-ebff-11ee-95d0-87774f69a715/ceph-mgr.x.log.gz -

2024-03-27T06:00:40.207+0000 7f426b494200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:00:55.715+0000 7f303e506200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:01:29.896+0000 7f8248a0c200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:07:39.269+0000 7fbe2299a200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:07:56.965+0000 7f09f5cd8200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:18:30.951+0000 7f61b33d3200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7
2024-03-27T06:18:40.251+0000 7f39b1c35200  0 ceph version 19.0.0-2478-g155268c4 (155268c4e432a12433aa833f174f9fe3b1016ae0) squid (dev), process ceph-mgr, pid 7

None of the PR in the testing batch looks related to this. Infact this doesn't look related to CephFS. Venky confirmed the same.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Rishabh Dave about 1 month ago

Project changed from Ceph to CephFS
Labels (FS) qa-failure added

Actions

Copy link

Updated by Laura Flores about 1 month ago

Looks like the MGR went down because of:

/a/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625569/remote/smithi184/log/2d1fee3e-ebff-11ee-95d0-87774f69a715/ceph-mgr.x.log.gz

2024-03-27T06:08:46.804+0000 7f0970a00700  0 [nfs ERROR nfs.export] Failed to apply export: path /testfile is not a dir
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/nfs/export.py", line 76, in validate_cephfs_path
    cephfs_path_is_dir(mgr, fs_name, path)
  File "/usr/share/ceph/mgr/nfs/utils.py", line 104, in cephfs_path_is_dir
    raise NotADirectoryError()
NotADirectoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/nfs/export.py", line 581, in _change_export
    return self._apply_export(cluster_id, export)
  File "/usr/share/ceph/mgr/nfs/export.py", line 830, in _apply_export
    new_export_dict
  File "/usr/share/ceph/mgr/nfs/export.py", line 689, in create_export_from_dict
    validate_cephfs_path(self.mgr, fs_name, path)
  File "/usr/share/ceph/mgr/nfs/export.py", line 78, in validate_cephfs_path
    raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)
nfs.exception.NFSException: path /testfile is not a dir

Actions

Copy link

Updated by Venky Shankar 29 days ago

Category set to Correctness/Safety
Assignee set to Dhairya Parmar
Priority changed from Normal to Urgent
Target version set to v20.0.0
Backport set to quincy,reef,squid
Component(FS) mgr/nfs added
Labels (FS) deleted (~~qa-failure~~)

Thanks for taking a look, Laura.

Dhariya, please take this one. AFAICT, this exception should have been handle in mgr/nfs and a errno should have been returned to the caller.

Actions

Copy link

Updated by Venky Shankar 27 days ago

Related to Bug #65021: qa/suites/fs/nfs: cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster log added

Actions

Copy link

Updated by Dhairya Parmar 24 days ago

how are we hitting this now, this code has been existent since quite sometime and it always had worked fine

Actions

Copy link

Updated by Dhairya Parmar 24 days ago

`validate_cephfs_path()` calls `cephfs_path_is_dir()` for every path, if the path is not a dir it raises `NotADirectoryError()` and so the `validate_cephfs_path()` should catch it in the first `except` block and raise the exception. I'm not sure how this isn't working

def validate_cephfs_path(mgr: 'Module', fs_name: str, path: str) -> None:
    try:
        cephfs_path_is_dir(mgr, fs_name, path)
    except NotADirectoryError:
        raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)
    except cephfs.ObjectNotFound:
        raise NFSObjectNotFound(f"path {path} does not exist")
    except cephfs.Error as e:
        raise NFSException(e.args[1], -e.args[0])

def cephfs_path_is_dir(mgr: 'Module', fs: str, path: str) -> None:
    @functools.lru_cache(maxsize=1)
    def _get_cephfs_client() -> CephfsClient:
        return CephfsClient(mgr)
    cephfs_client = _get_cephfs_client()

    with open_filesystem(cephfs_client, fs) as fs_handle:
        stx = fs_handle.statx(path.encode('utf-8'), cephfs.CEPH_STATX_MODE,
                              cephfs.AT_SYMLINK_NOFOLLOW)
        if not stat.S_ISDIR(stx.get('mode')):
            raise NotADirectoryError()

Actions

Copy link

Updated by Dhairya Parmar 23 days ago

Venky Shankar wrote in #note-3:

Thanks for taking a look, Laura.

Dhariya, please take this one. AFAICT, this exception should have been handle in mgr/nfs and a errno should have been returned to the caller.

it does exactly this, `raise NFSException(f"path {path} is not a dir", -errno.ENOTDIR)` is what should send errno and err str to the CLI

Actions

Copy link

Updated by Venky Shankar 22 days ago

NotADirectoryError is probably not a valid (in-built) exception in some python version. My question is, if this exception is getting handles, then why is it showing up in the mgr log?

Actions

Copy link

Updated by Venky Shankar 22 days ago

Dhairya mentioned that the tracebacks seems in the mgr logs are logged by object formatter and not necessarily unhandled exception. This means that those tracebacks aren't really the underlying issue for the MGR_DOWN warning.

Actions

Copy link

#10

Updated by Dhairya Parmar 20 days ago

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was

"2024-04-10T07:01:11.813042+0000 mon.a (mon.0) 610 : cluster [WRN] Health check failed: 1 stray daemon(s) not managed by cephadm (CEPHADM_STRAY_DAEMON)" in cluster log

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

Actions

Copy link

#11

Updated by Venky Shankar 18 days ago

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Actions

Copy link

#12

Updated by Dhairya Parmar 17 days ago

this doesn't seem related to test cases at all

time when the MGR_DOWN warning was seen:

2024-03-27T06:07:34.833 INFO:journalctl@ceph.mon.a.smithi184.stdout:Mar 27 06:07:34 smithi184 bash[21504]: cluster 2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN] Health check failed: no active mgr (MGR_DOWN)

time when first test case ran:

2024-03-27T06:07:52.025 INFO:tasks.cephfs_test_runner:Starting test: test_cephfs_export_update_at_non_dir_path (tasks.cephfs.test_nfs.TestNFS)

failure reason too mentioned the same timestamp:

failure_reason: '"2024-03-27T06:07:34.219228+0000 mon.a (mon.0) 323 : cluster [WRN]
  Health check failed: no active mgr (MGR_DOWN)" in cluster log

Actions

Copy link

#13

Updated by Venky Shankar 17 days ago

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

Actions

Copy link

#14

Updated by Dhairya Parmar 16 days ago

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

Actions

Copy link

#15

Updated by Venky Shankar 16 days ago

Dhairya Parmar wrote in #note-14:

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

That works too +1

Actions

Copy link

#16

Updated by Dhairya Parmar 16 days ago

Venky Shankar wrote in #note-15:

Dhairya Parmar wrote in #note-14:

Venky Shankar wrote in #note-13:

Venky Shankar wrote in #note-11:

Dhairya Parmar wrote in #note-10:

I was confident of the code, I've mentioned this in https://tracker.ceph.com/issues/65265#note-6. I then raised a PR trying out something but the job failed [0] but this time it was
[...]

Then I went probed the entire mgr log and I found 44 tracebacks(including the one mentioned by laura), those are just the logging of the exceptions raised(intentionally) since test_nfs consists of negative tests where invalid data is fed to make sure several edge cases are handled as intended. I'd have been surprised if those 'raised exception' logs weren't present in the mgr log.

I don't think this is an issue. I'm trying to investigate on why there weren't any active MGR, maybe something going wrong during upgrades.
[0] https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

That job has a single ceph-mgr daemon configured. test_exports_on_mgr_restart will fail the mgr for a jiffy - that might be causing the warning.

Needs to be ignore listed then - possibly a fallout from the recent clog changes :/

greg suggested we should go with 2 MGRs instead of 1, what do you think about this?

That works too +1

sure, thanks

Actions

Copy link

#17

Updated by Rishabh Dave 8 days ago · Edited

main branch - https://pulpito.ceph.com/rishabh-2024-04-24_07:32:23-fs-rishabh-main-apr17-a654945-testing-default-smithi/7671200

Actions

Copy link

#18

Updated by Dhairya Parmar 8 days ago

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

Actions

Copy link

#19

Updated by Patrick Donnelly 7 days ago

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

Actions

Copy link

#20

Updated by Dhairya Parmar 7 days ago · Edited

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

Actions

Copy link

#21

Updated by Dhairya Parmar 7 days ago

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Actions

Copy link

#22

Updated by Patrick Donnelly 7 days ago

Status changed from New to Fix Under Review
Pull request ID set to 56944

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

Yes! Does that change your "root cause analysis" in your PR/commit message?

Actions

Copy link

#23

Updated by Patrick Donnelly 7 days ago

Dhairya Parmar wrote in #note-21:

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Ignoring the warning is correct. I want you to clean up your analysis in the commit/PR.

Actions

Copy link

#24

Updated by Dhairya Parmar 6 days ago

Patrick Donnelly wrote in #note-23:

Dhairya Parmar wrote in #note-21:

Dhairya Parmar wrote in #note-20:

Patrick Donnelly wrote in #note-19:

Dhairya Parmar wrote in #note-18:

I ran a couple of NFS jobs, no `MGR_DOWN` reported

https://pulpito.ceph.com/dparmar-2024-04-10_06:37:26-fs:nfs-wip-65265-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_07:39:13-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

https://pulpito.ceph.com/dparmar-2024-04-25_09:10:36-fs:nfs-dparmar-24-apr-main-distro-default-smithi/

This warning is generated when mgr somehow got crashed and so no mgr available and `mgr fail` is ran.

What code is running `mgr fail` and why is the `fs` suite unaffected?

`fs`suite is unaffected because the code is ran by `qa/tasks/mgr/mgr_test_case.py` (https://github.com/ceph/ceph/blob/befd8dce33758178d3b108219d73b7710f68b133/qa/tasks/mgr/mgr_test_case.py#L78-L86), and the reason why this is prevalent only in `fs:nfs` suite is because it is the only suite in fs that makes use of class `MgrTestCase`

The solution I can think of is to check if mgr exists before running `cls.mgr_cluster.mgr_fail(mgr_id)` in the above code snippet.

Ignoring the warning is correct. I want you to clean up your analysis in the commit/PR.

done

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #65265

qa: health warning "no active mgr (MGR_DOWN)" occurs before and after test_nfs runs

Updated by Rishabh Dave about 1 month ago

Updated by Laura Flores about 1 month ago

Updated by Venky Shankar 29 days ago

Updated by Venky Shankar 27 days ago

Updated by Dhairya Parmar 24 days ago

Updated by Dhairya Parmar 24 days ago

Updated by Dhairya Parmar 23 days ago

Updated by Venky Shankar 22 days ago

Updated by Venky Shankar 22 days ago

Updated by Dhairya Parmar 20 days ago

Updated by Venky Shankar 18 days ago

Updated by Dhairya Parmar 17 days ago

Updated by Venky Shankar 17 days ago

Updated by Dhairya Parmar 16 days ago

Updated by Venky Shankar 16 days ago

Updated by Dhairya Parmar 16 days ago

Updated by Rishabh Dave 8 days ago · Edited

Updated by Dhairya Parmar 8 days ago

Updated by Patrick Donnelly 7 days ago

Updated by Dhairya Parmar 7 days ago · Edited

Updated by Dhairya Parmar 7 days ago

Updated by Patrick Donnelly 7 days ago

Updated by Patrick Donnelly 7 days ago

Updated by Dhairya Parmar 6 days ago