Bug #61907
closedapi tests fail from "MDS_CLIENTS_LAGGY" warning
100%
Description
Main api tests are failing. Upon investigation, I noticed that they are not all failing on the same api test each time, but every test involved fails in the same way: It times out waiting for the health to clear.
Here is an example of what Jenkins says the failure is:
https://jenkins.ceph.com/job/ceph-api/57740/consoleFull
2023-07-05 21:39:37,871.871 INFO:__main__:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 21:39:37,871.871 INFO:__main__:That module-handled commands have appropriate behavior on
2023-07-05 21:39:37,871.871 INFO:__main__:----------------------------------------------------------------------
2023-07-05 21:39:37,871.871 INFO:__main__:Traceback (most recent call last):
2023-07-05 21:39:37,871.871 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/test_module_selftest.py", line 204, in test_module_commands
2023-07-05 21:39:37,871.871 INFO:__main__: self.wait_for_health_clear(timeout=30)
2023-07-05 21:39:37,871.871 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 194, in wait_for_health_clear
2023-07-05 21:39:37,871.871 INFO:__main__: self.wait_until_true(is_clear, timeout)
2023-07-05 21:39:37,871.871 INFO:__main__: File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 231, in wait_until_true
2023-07-05 21:39:37,871.871 INFO:__main__: raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-07-05 21:39:37,871.871 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries
Inspecting the cluster logs on failed jobs, I'm seeing a warning "MDS_CLIENTS_LAGGY" that seems to come up in response to laggy OSDs:
https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/cluster.mon.a.log
2023-07-05T19:39:13.174105+0000 mon.a (mon.0) 1069 : cluster 4 Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-07-05T19:39:13.174232+0000 mon.a (mon.0) 1070 : cluster 3 Health check failed: 1 filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)
...
2023-07-05T19:59:31.058574+0000 mon.a (mon.0) 4204 : cluster 3 Health check failed: 2 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)
...
2023-07-05T20:01:12.644794+0000 mon.b (mon.1) 594 : audit 1 from='mgr.15715 172.21.2.15:0/302841377' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.647803+0000 mon.a (mon.0) 4335 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.664986+0000 mon.a (mon.0) 4336 : cluster 1 osd.1 marked itself down and dead
2023-07-05T20:01:12.872389+0000 mon.a (mon.0) 4337 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd='[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]': finished
2023-07-05T20:01:12.872796+0000 mon.a (mon.0) 4338 : cluster 0 osdmap e380: 4 total, 2 up, 4 in
2023-07-05T20:01:13.878271+0000 mon.a (mon.0) 4339 : cluster 0 osdmap e381: 4 total, 2 up, 4 in
2023-07-05T20:01:14.273062+0000 mon.a (mon.0) 4340 : cluster 0 osdmap e382: 4 total, 2 up, 4 in
2023-07-05T20:01:14.652110+0000 mgr.x (mgr.15715) 48 : cluster 0 pgmap v69: 209 pgs: 2 remapped+peering, 132 active+undersized, 75 active+undersized+degraded; 661 KiB data, 4.0 GiB used, 400 GiB / 404 GiB avail; 9.2 KiB/s wr, 7 op/s; 368/705 objects degraded (52.199%)
2023-07-05T20:01:14.807391+0000 mds.a (mds.0) 1 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13420), after 405.993 seconds
2023-07-05T20:01:14.807455+0000 mds.a (mds.0) 2 : cluster 1 Evicting (and blocklisting) client session 13420 (172.21.2.15:0/354370727)
2023-07-05T20:01:14.807710+0000 mds.a (mds.0) 3 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13436), after 405.993 seconds
2023-07-05T20:01:14.807749+0000 mds.a (mds.0) 4 : cluster 1 Evicting (and blocklisting) client session 13436 (172.21.2.15:0/4066908696)
2023-07-05T20:01:14.809392+0000 mon.a (mon.0) 4341 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/354370727"}]: dispatch
2023-07-05T20:01:14.811886+0000 mon.a (mon.0) 4342 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/4066908696"}]: dispatch
This warning was just merged last week in https://github.com/ceph/ceph/pull/49971. So far, it has not been backported, which aligns with these failures, which are only happening in main at the moment.
This warning might not be a problem itself, but either way, in a testing environment, it appears to cause tests to fail when they wait for the cluster health to clear.
More examples:
https://jenkins.ceph.com/job/ceph-api/57741/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57723/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57699/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57685/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57681/artifact/build/out/cluster.mon.a.log
Updated by Laura Flores 11 months ago
Full logs are available for all api jobs, i.e. https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/.
Updated by Laura Flores 11 months ago
- Assignee set to Dhairya Parmar
Hey @Dhairya Parmar can you take a look?
Updated by Patrick Donnelly 11 months ago
- Related to Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY added
Updated by Dhairya Parmar 11 months ago
Seems like some OSDs get laggy in testing environment; in this case either we can globally ignore the line(like adding MDS_CLIENTS_LAGGY to some global yaml's log-ignorelist?) or have this behaviour of deferring client eviction turned off by default(kept it turned on by default following recommendations in my PR)
Updated by Laura Flores 11 months ago
I think it makes sense to whitelist this warning in the testing environment since we expect laggy OSDs in some situations.
Updated by Laura Flores 11 months ago
Pretty sure that this yaml file is what whitelists messages in the api tests: https://github.com/ceph/ceph/blob/main/qa/tasks/thrashosds-health.yaml
Worth trying to add it there.
Updated by Laura Flores 11 months ago
Testing that out here: https://github.com/ceph/ceph/pull/52342
Marked as a draft since I want to check if the api test passes with this change.
Updated by Patrick Donnelly 11 months ago
Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.
I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".
Updated by Laura Flores 11 months ago
Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?
Updated by Ronen Friedman 11 months ago
Patrick Donnelly wrote:
Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.
I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".
I'd suggest ignoring it for now, allowing PRs to be merged, and continuing the debug effort in the background.
Updated by Patrick Donnelly 11 months ago
Laura Flores wrote:
Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?
Normally the MDS learns about this when the mon blocklists the registered clients during failover. This didn't happen possibly because the session was just established and the mons did yet record the registered client address to blocklist.
We can try ignoring for now as if that's the case then this actually isn't interesting.
Updated by Laura Flores 10 months ago
- Status changed from New to Fix Under Review
- Assignee changed from Dhairya Parmar to Laura Flores
- Pull request ID set to 52342
Updated by Nizamudeen A 10 months ago
- Status changed from Fix Under Review to Resolved
Updated by Patrick Donnelly 10 months ago
- Related to deleted (Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY)
Updated by Patrick Donnelly 10 months ago
- Has duplicate Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY added
Updated by Laura Flores 9 months ago
- Status changed from Resolved to Pending Backport
- Backport set to reef,quincy,pacific
Updated by Backport Bot 9 months ago
- Copied to Backport #62441: quincy: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Updated by Backport Bot 9 months ago
- Copied to Backport #62442: pacific: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Updated by Backport Bot 9 months ago
- Copied to Backport #62443: reef: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Updated by Laura Flores 9 months ago
- Related to Bug #62557: rados: Teuthology test failure due to "MDS_CLIENTS_LAGGY" warning added
Updated by Konstantin Shalygin 8 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100