Project

General

Profile

Actions

Bug #63105

open

mds: report clients laggy due laggy OSDs only after checking any OSD is laggy

Added by Dhairya Parmar 7 months ago. Updated 6 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Development
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY) MDS health message cleared (mds.?): Client 17676 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)
Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)
MDS health message cleared (mds.?): Client 28931 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.


Related issues 3 (2 open1 closed)

Copied to CephFS - Backport #63269: pacific: mds: report clients laggy due laggy OSDs only after checking any OSD is laggyResolvedDhairya ParmarActions
Copied to CephFS - Backport #63270: quincy: mds: report clients laggy due laggy OSDs only after checking any OSD is laggyIn ProgressDhairya ParmarActions
Copied to CephFS - Backport #63271: reef: mds: report clients laggy due laggy OSDs only after checking any OSD is laggyIn ProgressDhairya ParmarActions
Actions #1

Updated by Dhairya Parmar 7 months ago

  • Subject changed from mds: report clients laggy due laggy OSDs only after checking if any OSD is actually laggy to mds: report clients laggy due laggy OSDs only after checking any OSD is laggy
Actions #2

Updated by Dhairya Parmar 7 months ago

  • Pull request ID set to 53839
Actions #3

Updated by Dhairya Parmar 7 months ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Venky Shankar 7 months ago

Dhairya Parmar wrote:

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

[...]

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.

Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?

Actions #5

Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

[...]

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.

Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?

this patch is when we try report laggy clients post finding them in find_idle_sessions and evict_cap_revoke_non_responders, check detailed explanation https://github.com/ceph/ceph/pull/53839#issuecomment-1748916319

Actions #6

Updated by Maximilian Stinsky 7 months ago

Hello.

We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:

ceph --cluster es1 health detail
HEALTH_WARN 1 client(s) laggy due to laggy OSDs; 1 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENTS_LAGGY: 1 client(s) laggy due to laggy OSDs
    mds.DE-ES-001-03-09-05-2(mds.0): Client 3097440721 is laggy; not evicted because some OSD(s) is/are laggy
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.DE-ES-001-03-09-05-2(mds.0): Client DE-ES-001-03-07-01-8:cephfs failing to respond to capability release client_id: 3097440721
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.DE-ES-001-03-09-05-2(mds.0): 3 slow requests are blocked > 30 secs

We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.

After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.

Could it be that our problem is related to this bug report here or should we create a new one?

Actions #7

Updated by Venky Shankar 7 months ago

Hi Maximilian,

Maximilian Stinsky wrote:

Hello.

We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:

[...]

We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.

After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.

Could it be that our problem is related to this bug report here or should we create a new one?

Its likely that you are running into this bug. For now, please run with the config disabled.

Actions #8

Updated by Venky Shankar 6 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to reef,quincy,pacific
Actions #9

Updated by Backport Bot 6 months ago

  • Copied to Backport #63269: pacific: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Actions #10

Updated by Backport Bot 6 months ago

  • Copied to Backport #63270: quincy: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Actions #11

Updated by Backport Bot 6 months ago

  • Copied to Backport #63271: reef: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Actions #12

Updated by Backport Bot 6 months ago

  • Tags set to backport_processed
Actions #13

Updated by Laura Flores 6 months ago

/a/yuriw-2023-10-25_14:34:26-rados-wip-yuri5-testing-2023-10-24-0737-pacific-distro-default-smithi/7436955

Actions

Also available in: Atom PDF