Bug #63105: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy - CephFS - Ceph

Actions

Copy link

Bug #63105

open

mds: report clients laggy due laggy OSDs only after checking any OSD is laggy

Added by Dhairya Parmar 7 months ago. Updated 6 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

% Done:

Source:

Development

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

53839

Crash signature (v1):

Crash signature (v2):

Description

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY) MDS health message cleared (mds.?): Client 17676 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)
Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)
MDS health message cleared (mds.?): Client 28931 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Subject changed from mds: report clients laggy due laggy OSDs only after checking if any OSD is actually laggy to mds: report clients laggy due laggy OSDs only after checking any OSD is laggy

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Pull request ID set to 53839

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Venky Shankar 7 months ago

Dhairya Parmar wrote:

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

[...]

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.

Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?

Actions

Copy link

Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:

[...]

because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy

I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.

Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?

this patch is when we try report laggy clients post finding them in find_idle_sessions and evict_cap_revoke_non_responders, check detailed explanation https://github.com/ceph/ceph/pull/53839#issuecomment-1748916319

Actions

Copy link

Updated by Maximilian Stinsky 7 months ago

Hello.

We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:

ceph --cluster es1 health detail
HEALTH_WARN 1 client(s) laggy due to laggy OSDs; 1 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENTS_LAGGY: 1 client(s) laggy due to laggy OSDs
    mds.DE-ES-001-03-09-05-2(mds.0): Client 3097440721 is laggy; not evicted because some OSD(s) is/are laggy
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
    mds.DE-ES-001-03-09-05-2(mds.0): Client DE-ES-001-03-07-01-8:cephfs failing to respond to capability release client_id: 3097440721
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
    mds.DE-ES-001-03-09-05-2(mds.0): 3 slow requests are blocked > 30 secs

We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.

After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.

Could it be that our problem is related to this bug report here or should we create a new one?

Actions

Copy link

Updated by Venky Shankar 7 months ago

Hi Maximilian,

Maximilian Stinsky wrote:

Hello.

We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:

[...]

We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.

After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.

Could it be that our problem is related to this bug report here or should we create a new one?

Its likely that you are running into this bug. For now, please run with the config disabled.

Actions

Copy link