Bug #63105
openmds: report clients laggy due laggy OSDs only after checking any OSD is laggy
0%
Description
Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:
Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY) MDS health message cleared (mds.?): Client 17676 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)
Health check failed: 1 client(s) laggy due to laggy OSDs (MDS_CLIENTS_LAGGY)
MDS health message cleared (mds.?): Client 28931 is laggy; not evicted because some OSD(s) is/are laggy
Health check cleared: MDS_CLIENTS_LAGGY (was: 1 client(s) laggy due to laggy OSDs)
because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggy
I.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.
Updated by Dhairya Parmar 7 months ago
- Subject changed from mds: report clients laggy due laggy OSDs only after checking if any OSD is actually laggy to mds: report clients laggy due laggy OSDs only after checking any OSD is laggy
Updated by Dhairya Parmar 7 months ago
- Status changed from New to Fix Under Review
Updated by Venky Shankar 7 months ago
Dhairya Parmar wrote:
Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:
[...]
because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggyI.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.
Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?
Updated by Dhairya Parmar 7 months ago
Venky Shankar wrote:
Dhairya Parmar wrote:
Currently the code to report health warning about laggy clients due to laggy OSDs in mds/Beacon.cc is buggy since it reports:
[...]
because the current code in Beacon.cc checks if the laggy_clients set is non-empty. This is erroneous and must be fixed:
1) if any osd is laggy and there are laggy_clients with defer_client_eviction_on_laggy_osds true: Client X is laggy; not evicted because some OSD is/are laggy
2) if any osd is laggy and there are laggy_clients but defer_client_eviction_on_laggy_osds is unset: Client X is laggy because some OSD is/are laggyI.e. we will continue reporting clients that are laggy due to laggy osds but we will not say they are evicted when config defer_client_eviction_on_laggy_osds is unset/false/off.
Doesn't the laggy clients list get cleared here: https://github.com/ceph/ceph/blob/main/src/mds/MDSRank.cc#L752 ?
this patch is when we try report laggy clients post finding them in find_idle_sessions and evict_cap_revoke_non_responders, check detailed explanation https://github.com/ceph/ceph/pull/53839#issuecomment-1748916319
Updated by Maximilian Stinsky 7 months ago
Hello.
We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:
ceph --cluster es1 health detail
HEALTH_WARN 1 client(s) laggy due to laggy OSDs; 1 clients failing to respond to capability release; 1 MDSs report slow requests
[WRN] MDS_CLIENTS_LAGGY: 1 client(s) laggy due to laggy OSDs
mds.DE-ES-001-03-09-05-2(mds.0): Client 3097440721 is laggy; not evicted because some OSD(s) is/are laggy
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.DE-ES-001-03-09-05-2(mds.0): Client DE-ES-001-03-07-01-8:cephfs failing to respond to capability release client_id: 3097440721
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.DE-ES-001-03-09-05-2(mds.0): 3 slow requests are blocked > 30 secs
We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.
After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.
Could it be that our problem is related to this bug report here or should we create a new one?
Updated by Venky Shankar 7 months ago
Hi Maximilian,
Maximilian Stinsky wrote:
Hello.
We just upgraded one of our ceph clusters from 16.2.13 to 16.2.14. After the upgrade we have problems with our cephfs when we reboot servers that are mounting those.
The cluster often goes into a state like:[...]
We were only able to clear this by restarted the currently active mds. The cluster has no laggy osds at that point anywhere, only a hosting cephfs server was rebooted.
After digging around we found that https://github.com/ceph/ceph/pull/52270 was included in the 16.2.14 release so we disabled `defer_client_eviction_on_laggy_osds` which fixes the issue for now in our env.
Could it be that our problem is related to this bug report here or should we create a new one?
Its likely that you are running into this bug. For now, please run with the config disabled.
Updated by Venky Shankar 6 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to reef,quincy,pacific
Updated by Backport Bot 6 months ago
- Copied to Backport #63269: pacific: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Updated by Backport Bot 6 months ago
- Copied to Backport #63270: quincy: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Updated by Backport Bot 6 months ago
- Copied to Backport #63271: reef: mds: report clients laggy due laggy OSDs only after checking any OSD is laggy added
Updated by Laura Flores 6 months ago
/a/yuriw-2023-10-25_14:34:26-rados-wip-yuri5-testing-2023-10-24-0737-pacific-distro-default-smithi/7436955