Project

General

Profile

Actions

Bug #61907

closed

api tests fail from "MDS_CLIENTS_LAGGY" warning

Added by Laura Flores 11 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Main api tests are failing. Upon investigation, I noticed that they are not all failing on the same api test each time, but every test involved fails in the same way: It times out waiting for the health to clear.

Here is an example of what Jenkins says the failure is:
https://jenkins.ceph.com/job/ceph-api/57740/consoleFull

2023-07-05 21:39:37,871.871 INFO:__main__:ERROR: test_module_commands (tasks.mgr.test_module_selftest.TestModuleSelftest)
2023-07-05 21:39:37,871.871 INFO:__main__:That module-handled commands have appropriate  behavior on
2023-07-05 21:39:37,871.871 INFO:__main__:----------------------------------------------------------------------
2023-07-05 21:39:37,871.871 INFO:__main__:Traceback (most recent call last):
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/mgr/test_module_selftest.py", line 204, in test_module_commands
2023-07-05 21:39:37,871.871 INFO:__main__:    self.wait_for_health_clear(timeout=30)
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 194, in wait_for_health_clear
2023-07-05 21:39:37,871.871 INFO:__main__:    self.wait_until_true(is_clear, timeout)
2023-07-05 21:39:37,871.871 INFO:__main__:  File "/home/jenkins-build/build/workspace/ceph-api/qa/tasks/ceph_test_case.py", line 231, in wait_until_true
2023-07-05 21:39:37,871.871 INFO:__main__:    raise TestTimeoutError("Timed out after {0}s and {1} retries".format(elapsed, retry_count))
2023-07-05 21:39:37,871.871 INFO:__main__:tasks.ceph_test_case.TestTimeoutError: Timed out after 30s and 0 retries

Inspecting the cluster logs on failed jobs, I'm seeing a warning "MDS_CLIENTS_LAGGY" that seems to come up in response to laggy OSDs:
https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/cluster.mon.a.log

2023-07-05T19:39:13.174105+0000 mon.a (mon.0) 1069 : cluster 4 Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2023-07-05T19:39:13.174232+0000 mon.a (mon.0) 1070 : cluster 3 Health check failed: 1 filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)

...

2023-07-05T19:59:31.058574+0000 mon.a (mon.0) 4204 : cluster 3 Health check failed: 2 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)

...

2023-07-05T20:01:12.644794+0000 mon.b (mon.1) 594 : audit 1 from='mgr.15715 172.21.2.15:0/302841377' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.647803+0000 mon.a (mon.0) 4335 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd=[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]: dispatch
2023-07-05T20:01:12.664986+0000 mon.a (mon.0) 4336 : cluster 1 osd.1 marked itself down and dead
2023-07-05T20:01:12.872389+0000 mon.a (mon.0) 4337 : audit 1 from='mgr.15715 ' entity='mgr.x' cmd='[{"prefix": "osd pool set", "pool": "cephfs_metadata", "var": "pgp_num_actual", "val": "8"}]': finished
2023-07-05T20:01:12.872796+0000 mon.a (mon.0) 4338 : cluster 0 osdmap e380: 4 total, 2 up, 4 in
2023-07-05T20:01:13.878271+0000 mon.a (mon.0) 4339 : cluster 0 osdmap e381: 4 total, 2 up, 4 in
2023-07-05T20:01:14.273062+0000 mon.a (mon.0) 4340 : cluster 0 osdmap e382: 4 total, 2 up, 4 in
2023-07-05T20:01:14.652110+0000 mgr.x (mgr.15715) 48 : cluster 0 pgmap v69: 209 pgs: 2 remapped+peering, 132 active+undersized, 75 active+undersized+degraded; 661 KiB data, 4.0 GiB used, 400 GiB / 404 GiB avail; 9.2 KiB/s wr, 7 op/s; 368/705 objects degraded (52.199%)
2023-07-05T20:01:14.807391+0000 mds.a (mds.0) 1 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13420), after 405.993 seconds
2023-07-05T20:01:14.807455+0000 mds.a (mds.0) 2 : cluster 1 Evicting (and blocklisting) client session 13420 (172.21.2.15:0/354370727)
2023-07-05T20:01:14.807710+0000 mds.a (mds.0) 3 : cluster 3 evicting unresponsive client braggi15.front.sepia.ceph.com:z (13436), after 405.993 seconds
2023-07-05T20:01:14.807749+0000 mds.a (mds.0) 4 : cluster 1 Evicting (and blocklisting) client session 13436 (172.21.2.15:0/4066908696)
2023-07-05T20:01:14.809392+0000 mon.a (mon.0) 4341 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/354370727"}]: dispatch
2023-07-05T20:01:14.811886+0000 mon.a (mon.0) 4342 : audit 1 from='mds.? [v2:172.21.2.15:6834/2853370360,v1:172.21.2.15:6835/2853370360]' entity='mds.a' cmd=[{"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.2.15:0/4066908696"}]: dispatch

This warning was just merged last week in https://github.com/ceph/ceph/pull/49971. So far, it has not been backported, which aligns with these failures, which are only happening in main at the moment.

This warning might not be a problem itself, but either way, in a testing environment, it appears to cause tests to fail when they wait for the cluster health to clear.

More examples:
https://jenkins.ceph.com/job/ceph-api/57741/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57723/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57699/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57685/artifact/build/out/cluster.mon.a.log
https://jenkins.ceph.com/job/ceph-api/57681/artifact/build/out/cluster.mon.a.log


Related issues 5 (1 open4 closed)

Related to RADOS - Bug #62557: rados: Teuthology test failure due to "MDS_CLIENTS_LAGGY" warningNew

Actions
Has duplicate CephFS - Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGYDuplicateDhairya Parmar

Actions
Copied to CephFS - Backport #62441: quincy: api tests fail from "MDS_CLIENTS_LAGGY" warningResolvedLaura FloresActions
Copied to CephFS - Backport #62442: pacific: api tests fail from "MDS_CLIENTS_LAGGY" warningResolvedLaura FloresActions
Copied to CephFS - Backport #62443: reef: api tests fail from "MDS_CLIENTS_LAGGY" warningResolvedLaura FloresActions
Actions #1

Updated by Laura Flores 11 months ago

Full logs are available for all api jobs, i.e. https://jenkins.ceph.com/job/ceph-api/57735/artifact/build/out/.

Actions #2

Updated by Laura Flores 11 months ago

  • Assignee set to Dhairya Parmar

Hey @Dhairya Parmar can you take a look?

Actions #3

Updated by Laura Flores 11 months ago

  • Priority changed from Normal to High
Actions #4

Updated by Patrick Donnelly 11 months ago

  • Related to Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY added
Actions #5

Updated by Dhairya Parmar 11 months ago

Seems like some OSDs get laggy in testing environment; in this case either we can globally ignore the line(like adding MDS_CLIENTS_LAGGY to some global yaml's log-ignorelist?) or have this behaviour of deferring client eviction turned off by default(kept it turned on by default following recommendations in my PR)

Actions #6

Updated by Laura Flores 11 months ago

I think it makes sense to whitelist this warning in the testing environment since we expect laggy OSDs in some situations.

Actions #7

Updated by Laura Flores 11 months ago

Pretty sure that this yaml file is what whitelists messages in the api tests: https://github.com/ceph/ceph/blob/main/qa/tasks/thrashosds-health.yaml

Worth trying to add it there.

Actions #8

Updated by Laura Flores 11 months ago

Testing that out here: https://github.com/ceph/ceph/pull/52342

Marked as a draft since I want to check if the api test passes with this change.

Actions #9

Updated by Patrick Donnelly 11 months ago

Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.

I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".

Actions #10

Updated by Laura Flores 11 months ago

Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?

Actions #11

Updated by Ronen Friedman 11 months ago

Patrick Donnelly wrote:

Ignoring this may not be what we want. The MDS would normally evict these clients but does not because the OSDs are "laggy". That's what the warning is really about.

I suspect this test (or the previous one) involves the mgr failing over or some other recovery situation. The MDS may not know it restarted and is sitting on those clients because OSDs are also "laggy".

I'd suggest ignoring it for now, allowing PRs to be merged, and continuing the debug effort in the background.

Actions #12

Updated by Patrick Donnelly 11 months ago

Laura Flores wrote:

Would it be better then to somehow let the MDS know about intentional restarts / recovery situations?

Normally the MDS learns about this when the mon blocklists the registered clients during failover. This didn't happen possibly because the session was just established and the mons did yet record the registered client address to blocklist.

We can try ignoring for now as if that's the case then this actually isn't interesting.

Actions #13

Updated by Laura Flores 10 months ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Dhairya Parmar to Laura Flores
  • Pull request ID set to 52342
Actions #14

Updated by Nizamudeen A 10 months ago

  • Status changed from Fix Under Review to Resolved
Actions #15

Updated by Patrick Donnelly 10 months ago

  • Related to deleted (Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY)
Actions #16

Updated by Patrick Donnelly 10 months ago

  • Has duplicate Bug #61897: qa: rados:mgr fails with MDS_CLIENTS_LAGGY added
Actions #17

Updated by Laura Flores 9 months ago

  • Status changed from Resolved to Pending Backport
  • Backport set to reef,quincy,pacific
Actions #18

Updated by Backport Bot 9 months ago

  • Copied to Backport #62441: quincy: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Actions #19

Updated by Backport Bot 9 months ago

  • Copied to Backport #62442: pacific: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Actions #20

Updated by Backport Bot 9 months ago

  • Copied to Backport #62443: reef: api tests fail from "MDS_CLIENTS_LAGGY" warning added
Actions #21

Updated by Backport Bot 9 months ago

  • Tags set to backport_processed
Actions #22

Updated by Laura Flores 9 months ago

  • Related to Bug #62557: rados: Teuthology test failure due to "MDS_CLIENTS_LAGGY" warning added
Actions #23

Updated by Konstantin Shalygin 8 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF