Bug #61907: api tests fail from "MDS_CLIENTS_LAGGY" warning - CephFS - Ceph

Actions

Copy link

Bug #61907

closed

api tests fail from "MDS_CLIENTS_LAGGY" warning

Added by Laura Flores 11 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

High

Assignee:

Laura Flores

Category:

Target version:

% Done:

100%

Source:

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

52342

Crash signature (v1):

Crash signature (v2):

Description

Main api tests are failing. Upon investigation, I noticed that they are not all failing on the same api test each time, but every test involved fails in the same way: It times out waiting for the health to clear.

Here is an example of what Jenkins says the failure is:
https://jenkins.ceph.com/job/ceph-api/57740/consoleFull

2023-07-05 21:39:37,871.871 INFO:__main__:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 21:39:37,871.871 INFO:__main__:That module-handled commands have appropriate  behavior on
2023-07-05 21:39:37,871.871 INFO:__main__:----------------------------------------------------------------------
2023-07-05 21:39:37,871.871 INFO:__main__:Traceback (most recent call last):
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/test_module_selftest.py", line 204, in test_module_commands
2023-07-05 21:39:37,871.871 INFO:__main__:    self.wait_for_health_clear(timeout=30)
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 194, in wait_for_health_clear
2023-07-05 21:39:37,871.871 INFO:__main__:    self.wait_until_true(is_clear, timeout)
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 231, in wait_until_true
2023-07-05 21:39:37,871.871 INFO:__main__:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-07-05 21:39:37,871.871 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries

Inspecting the cluster logs on failed jobs, I'm seeing a warning "MDS_CLIENTS_LAGGY" that seems to come up in response to laggy OSDs:
https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/cluster.mon.a.log

2023-07-05T19:39:13.174105+0000 mon.a (mon.0) 1069 : cluster 4 Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-07-05T19:39:13.174232+0000 mon.a (mon.0) 1070 : cluster 3 Health check failed: 1 filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)

...

2023-07-05T19:59:31.058574+0000 mon.a (mon.0) 4204 : cluster 3 Health check failed: 2 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)

...

2023-07-05T20:01:12.644794+0000 mon.b (mon.1) 594 : audit 1 from='mgr.15715 172.21.2.15:0/302841377' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.647803+0000 mon.a (mon.0) 4335 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.664986+0000 mon.a (mon.0) 4336 : cluster 1 osd.1 marked itself down and dead
2023-07-05T20:01:12.872389+0000 mon.a (mon.0) 4337 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd='[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]': finished
2023-07-05T20:01:12.872796+0000 mon.a (mon.0) 4338 : cluster 0 osdmap e380: 4 total, 2 up, 4 in
2023-07-05T20:01:13.878271+0000 mon.a (mon.0) 4339 : cluster 0 osdmap e381: 4 total, 2 up, 4 in
2023-07-05T20:01:14.273062+0000 mon.a (mon.0) 4340 : cluster 0 osdmap e382: 4 total, 2 up, 4 in
2023-07-05T20:01:14.652110+0000 mgr.x (mgr.15715) 48 : cluster 0 pgmap v69: 209 pgs: 2 remapped+peering, 132 active+undersized, 75 active+undersized+degraded; 661 KiB data, 4.0 GiB used, 400 GiB / 404 GiB avail; 9.2 KiB/s wr, 7 op/s; 368/705 objects degraded (52.199%)
2023-07-05T20:01:14.807391+0000 mds.a (mds.0) 1 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13420), after 405.993 seconds
2023-07-05T20:01:14.807455+0000 mds.a (mds.0) 2 : cluster 1 Evicting (and blocklisting) client session 13420 (172.21.2.15:0/354370727)
2023-07-05T20:01:14.807710+0000 mds.a (mds.0) 3 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13436), after 405.993 seconds
2023-07-05T20:01:14.807749+0000 mds.a (mds.0) 4 : cluster 1 Evicting (and blocklisting) client session 13436 (172.21.2.15:0/4066908696)
2023-07-05T20:01:14.809392+0000 mon.a (mon.0) 4341 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/354370727"}]: dispatch
2023-07-05T20:01:14.811886+0000 mon.a (mon.0) 4342 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/4066908696"}]: dispatch

This warning was just merged last week in https://github.com/ceph/ceph/pull/49971. So far, it has not been backported, which aligns with these failures, which are only happening in main at the moment.

This warning might not be a problem itself, but either way, in a testing environment, it appears to cause tests to fail when they wait for the cluster health to clear.

More examples:
https://jenkins.ceph.com/job/ceph-api/57741/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57723/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57699/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57685/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57681/artifact/build/out/cluster.mon.a.log

Related issues 5 (1 open — 4 closed)

Actions

Copy link

Updated by Laura Flores 11 months ago

Full logs are available for all api jobs, i.e. https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/.

Actions

Copy link

Updated by Laura Flores 11 months ago

Assignee set to Dhairya Parmar

Hey @Dhairya Parmar can you take a look?

Actions

Copy link

Updated by Laura Flores 11 months ago

Priority changed from Normal to High

Actions

Copy link

Updated by Patrick Donnelly 11 months ago

Related to Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY added

Actions

Copy link

Updated by Dhairya Parmar 11 months ago

Seems like some OSDs get laggy in testing environment; in this case either we can globally ignore the line(like adding MDS_CLIENTS_LAGGY to some global yaml's log-ignorelist?) or have this behaviour of deferring client eviction turned off by default(kept it turned on by default following recommendations in my PR)

Actions

Copy link

Updated by Laura Flores 11 months ago

I think it makes sense to whitelist this warning in the testing environment since we expect laggy OSDs in some situations.

Actions

Copy link

Updated by Laura Flores 11 months ago

Pretty sure that this yaml file is what whitelists messages in the api tests: https://github.com/ceph/ceph/blob/main/qa/tasks/thrashosds-health.yaml

Worth trying to add it there.

Actions

Copy link

Updated by Laura Flores 11 months ago

Testing that out here: https://github.com/ceph/ceph/pull/52342

Marked as a draft since I want to check if the api test passes with this change.

Actions

Copy link

Updated by Patrick Donnelly 11 months ago

Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.

I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".

Actions

Copy link

#10

Updated by Laura Flores 11 months ago

Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?

Actions

Copy link

#11

Updated by Ronen Friedman 11 months ago

Patrick Donnelly wrote:

Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.

I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".

I'd suggest ignoring it for now, allowing PRs to be merged, and continuing the debug effort in the background.

Actions

Copy link

#12

Updated by Patrick Donnelly 11 months ago

Laura Flores wrote:

Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?

Normally the MDS learns about this when the mon blocklists the registered clients during failover. This didn't happen possibly because the session was just established and the mons did yet record the registered client address to blocklist.

We can try ignoring for now as if that's the case then this actually isn't interesting.

Actions

Copy link

#13