Bug #64563: mds: enhance laggy clients detections due to laggy OSDs - CephFS - Ceph

Actions

Copy link

Bug #64563

open

mds: enhance laggy clients detections due to laggy OSDs

Added by Dhairya Parmar 2 months ago. Updated 9 days ago.

Status:

Triaged

Priority:

Urgent

Assignee:

Dhairya Parmar

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Development

Tags:

Backport:

quincy,reef,squid

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Right now the code happily accepts that if there is any laggy OSD and a client got laggy then it must be due to the OSD, the code is note able to differentiate between "this client isn't responding" and "this client is slow to release caps" meaning if the client went off the grid AND we have any laggy OSD, MDS will not evict that client and instead mark it as "laggy due to laggy OSDs" which is completely wrong. There are five instances where clients are added to set `laggy_clients` in `Server.cc` which needs to re-evaluated and make sure that we do consider client eviction in cases like [0] where the last cap renew span is more than the session autoclose duration(i.e. 300 seconds default config) which is long enough to conclude that we have lost the client.

Not only this, there needs to be some sane/practical values for which we consider any osd to be worthy enough to be considered laggy i.e. "laggy enough to make a client(or anything) laggy", current implementation is too naive as it just checks whether any laggy param(osd_xinfo_t.laggy_interval or osd_xinfo_t.laggy_probability) is non-zero. This will make MDS not evict clients even though that slight OSD lagginess might not be that serious at all. In other words we need to make the helper OSDMap::any_osd_laggy a bit more smart and fine-grained.

[0] https://github.com/ceph/ceph/blob/main/src/mds/Server.cc#L1184-L1190

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 2 months ago

Priority changed from High to Urgent
Severity changed from 3 - minor to 2 - major

Dhairya,

Before we go into improving the lagginess detection infrastructure, let's verify if there isn't a (corner case) bug in the config implementation. If we found one, then we need to send out a PSA to our users to disable this config (since it's enabled by default!) in some releases. Let's start by adding a test case to our qa suite to mimic the bug workflow.

Actions

Copy link

Updated by Venky Shankar 2 months ago

Status changed from New to Triaged

Actions

Copy link

Updated by Dhairya Parmar 2 months ago

Venky Shankar wrote:

Dhairya,

Before we go into improving the lagginess detection infrastructure, let's verify if there isn't a (corner case) bug in the config implementation. If we found one, then we need to send out a PSA to our users to disable this config (since it's enabled by default!) in some releases. Let's start by adding a test case to our qa suite to mimic the bug workflow.

Test case https://github.com/ceph/ceph/pull/55784:

New client stuck at doing I/O:

2024-02-28T08:14:43.920 DEBUG:tasks.cephfs.mount:Creating test file 2024-02-28 08:14:43.920480 0
2024-02-28T08:14:43.920 DEBUG:teuthology.orchestra.run.smithi071:> touch '/home/ubuntu/cephtest/mnt.0/2024-02-28 08:14:43.920480 0'
2024-02-28T08:14:59.829 DEBUG:teuthology.orchestra.run.smithi071:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2024-02-28T08:14:59.833 DEBUG:teuthology.orchestra.run.smithi164:> sudo logrotate /etc/logrotate.d/ceph-test.conf
<same logrotate logs till i killed the job>

From the MDS logs, it indicates that since the old client isn't evicted it still holds the caps needed by the new client to do I/O and hence the new client is blocked infinitely on the I/O.

2024-03-01T09:07:16.545+0000 7fcac435f640  0 log_channel(cluster) log [WRN] : 1 slow requests, 0 included below; oldest blocked for > 3479.724215 secs
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.server find_idle_sessions. last cleared laggy state 5763.15s ago
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.server evicting session client.6834 192.168.0.1:0/1179488633 since autoclose has arrived
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.server laggiest active session is client.5270 192.168.0.1:0/4099907527 and renewed caps recently (13.0557s ago)
2024-03-01T09:07:16.545+0000 7fcac435f640  5 mds.0.server Detected 1 laggy clients, possibly due to laggy OSDs. Eviction is skipped until the OSDs return to normal.
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.locker scatter_tick
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.locker caps_tick 1 revoking caps
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.locker caps_tick age = 3479.723644 client.6834.0x1
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.locker caps_tick silencing log message (backoff) for client.6834.0x1
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.bal handle_export_pins export_pin_queue size=0
2024-03-01T09:07:16.545+0000 7fcac435f640 15 mds.0.bal tick tick last_sample now 5763.148389s
2024-03-01T09:07:16.545+0000 7fcac435f640 15 mds.0.cache show_subtrees
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.cache |__ 0    auth [dir 0x1 / [2,head] auth v=1 cv=1/1 REP dir_auth=0 ap=0+1 state=1073741825|complete f() n() hs=0+1,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x557719352480]
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.cache |__ 0    auth [dir 0x100 ~mds0/ [2,head] auth v=1 cv=1/1 dir_auth=0 state=1073741825|complete f(v0 10=0+10) n(v0 rc2024-03-01T08:00:35.817341+0000 10=0+10) hs=10+0,ss=0+0 | child=1 subtree=1 dirty=0 authpin=0 0x557719352900]
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.cache find_stale_fragment_freeze
2024-03-01T09:07:16.545+0000 7fcac435f640 10 mds.0.snap check_osd_map - version unchanged
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.0.30 updating export targets, currently 0 ranks are targets
2024-03-01T09:07:16.545+0000 7fcac435f640 20 mds.beacon.a 1 slow request found
2024-03-01T09:07:16.833+0000 7fcac335d640 10 mds.pinger is_rank_lagging: rank=0
2024-03-01T09:07:16.833+0000 7fcac335d640  1 -- [v2:172.21.15.120:6835/3304576025,v1:172.21.15.120:6837/3304576025] --> [v2:172.21.15.120:6832/569419180,v1:172.21.15.120:6833/569419180] -- mgrreport(unknown.a +0-0 packed 1598) v9 -- 0x5577195c2e00 con 0x557719311000
2024-03-01T09:07:16.925+0000 7fcac8b68640  1 -- [v2:172.21.15.120:6835/3304576025,v1:172.21.15.120:6837/3304576025] <== client.7026 192.168.0.1:0/4001938067 3659 ==== client_metrics [client_metric_type: READ_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: WRITE_LATENCY latency: 0.000000, avg_latency: 0.000000, sq_sum: 0, count=0][client_metric_type: METADATA_LATENCY latency: 0.000442, avg_latency: 0.000442, sq_sum: 0, count=1][client_metric_type: CAP_INFO cap_hits: 8 cap_misses: 2 num_caps: 0][client_metric_type: DENTRY_LEASE dlease_hits: 0 dlease_misses: 0 num_dentries: 0][client_metric_type: OPENED_FILES opened_files: 0 total_inodes: 1][client_metric_type: PINNED_ICAPS pinned_icaps: 1 total_inodes: 1][client_metric_type: OPENED_INODES opened_inodes: 0 total_inodes: 1][client_metric_type: READ_IO_SIZES total_ops: 0 total_size: 0][client_metric_type: WRITE_IO_SIZES total_ops: 0 total_size: 0] v1 ==== 328+0+0 (crc 0 0 0) 0x5577192656c0 con 0x5577183db000

2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=CAP_INFO, session=0x55771942a500, hits=8, misses=2
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=DENTRY_LEASE, session=0x55771942a500, hits=0, misses=0
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=OPENED_FILES, session=0x55771942a500, opened_files=0, total_inodes=1
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=PINNED_ICAPS, session=0x55771942a500, pinned_icaps=1, total_inodes=1
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=OPENED_INODES, session=0x55771942a500, opened_inodes=0, total_inodes=1
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=READ_IO_SIZES, session=0x55771942a500, total_ops=0, total_size=0
2024-03-01T09:07:16.925+0000 7fcac8b68640 20 handle_payload: mds.metrics: type=WRITE_IO_SIZES, session=0x55771942a500, total_ops=0, total_size=0

How about revoking the caps of the client that is laggy, provide back the caps only after it becomes normal(when OSD turn normal)? if a new client joins then the caps provision will happen as intended and if the old client becomes normal then we can renew the caps or maybe ask the client to reconnect?

Actions

Copy link

Updated by Venky Shankar about 2 months ago

qa test case reproducer: https://github.com/ceph/ceph/pull/55784

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Assignee set to Dhairya Parmar
Target version set to v20.0.0
Backport set to quincy,reef,squid

Actions

Copy link

Updated by Dhairya Parmar about 2 months ago

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

Actions

Copy link

Updated by Dhairya Parmar about 2 months ago

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

Actions

Copy link

#10

Updated by Venky Shankar about 2 months ago

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Hmmm.. the mds would still need to infer the set of OSDs in the data pool(s) that are laggy and then skip evicting clients which are flushing data to this (sub)set of laggy OSDs, right? IIRC somewhere in the original PR, Greg mentioned that this could be a bit computationally expensive. Are we clear on that approach?

Actions

Copy link

#11

Updated by Venky Shankar about 2 months ago

Venky Shankar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Hmmm.. the mds would still need to infer the set of OSDs in the data pool(s) that are laggy and then skip evicting clients which are flushing data to this (sub)set of laggy OSDs, right? IIRC somewhere in the original PR, Greg mentioned that this could be a bit computationally expensive. Are we clear on that approach?

Maybe the computationally expensive part involved the MDS trying to infer this all on its own. I need to revisit that discussion.

Actions

Copy link

#12

Updated by Dhairya Parmar about 2 months ago

the old client can simply be put on hold - revoke caps and pause I/O. Wait for the time autoclose arrives(def 300s); if it comes back live (< 300s), renew the caps else evict it.

Actions

Copy link

#13

Updated by Dhairya Parmar about 2 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

whats the point of the config then if it has to be kept disabled. i believe fixes are possible keeping this config turned on.

Actions

Copy link

#14

Updated by Venky Shankar about 2 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

whats the point of the config then if it has to be kept disabled. i believe fixes are possible keeping this config turned on.

for backporting - we cannot force everyone to upgrade their clusters, isn't it?

Actions

Copy link

#15

Updated by Dhairya Parmar about 2 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

whats the point of the config then if it has to be kept disabled. i believe fixes are possible keeping this config turned on.

for backporting - we cannot force everyone to upgrade their clusters, isn't it?

we can backport the fixes too

Actions

Copy link

#16

Updated by Venky Shankar about 2 months ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

whats the point of the config then if it has to be kept disabled. i believe fixes are possible keeping this config turned on.

for backporting - we cannot force everyone to upgrade their clusters, isn't it?

we can backport the fixes too

Dhairya, the fix can take a while before they land and get backported. Would you agree? Till that time, we have to request users to disable the config. Also, to be a good citizen we should not ship subsequent release with the default being enabled when we know that it's prone to cause problems.

Actions

Copy link

#17

Updated by Venky Shankar about 2 months ago

Dhairya Parmar wrote:

the old client can simply be put on hold - revoke caps and pause I/O. Wait for the time autoclose arrives(def 300s); if it comes back live (< 300s), renew the caps else evict it.

I have to think through the details. I'm revisiting the discussing on the PR again today. Will update in a while...

EDIT: revisit the discussion on the original PR.

Actions

Copy link

#18

Updated by Dhairya Parmar about 2 months ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Patrick Donnelly wrote:

Dhairya Parmar wrote:

I'm waiting for patrick/venky's response on this since they had discussed some approach regarding changes to some protocol https://github.com/ceph/ceph/pull/55784#issuecomment-1978214384

Basically we've learned that we cannot really use the OSDMap to decide whether a client is sitting on cap revokes because it cannot flush data. The solution proposed was to enhance MClientSession with op == CEPH_SESSION_RENEWCAPS (which is used to renew caps) to indicate which caps cannot be released due to dirty data flushes outstanding and possibly which OSDs are implicated.

Obviously only new clients could communicate this to the MDS.

what about old client?

The config should be disabled by default so that old clients are evicted as they were done before this change was introduced.

whats the point of the config then if it has to be kept disabled. i believe fixes are possible keeping this config turned on.

for backporting - we cannot force everyone to upgrade their clusters, isn't it?

we can backport the fixes too

Dhairya, the fix can take a while before they land and get backported. Would you agree?

I was looking at this from the perspective of the new changes i.e. to allow client eviction was the way ahead is what i understood. i think i confused two things here - disabling config and the workaround for the problem

Actions

Copy link

#19

Updated by Venky Shankar about 2 months ago

Related to Fix #58023: mds: do not evict clients if OSDs are laggy added

Actions

Copy link

#20

Updated by Dhairya Parmar about 1 month ago

Me and greg had a discussion on this today, I think the first step should be to remove the laggy client detection from Server::find_idle_sesisons since we all agreed that it is totally incorrect, I remember venky had a commit in https://github.com/ceph/ceph/pull/53578 that removed the detection code from the func. I think it would be great to have that commit (sorry venky, you were right at that time) AND better way of saying "an OSD is laggy". I'll try to get in touch with someone from RADOS and come up with more sensible parameters for OSDMap::any_osd_laggy(). So in summary the removal of laggy clients detection code from the Server::find_idle_sessions + enhancements to OSDMap::any_osd_laggy() is what I'm looking as the first step.

As far as next steps are concerned, I'm really looking forward to patrick's idea of making protocol changes to client code in a way that it says it is communicating with an OSD which it seems the comms are laggy, but it would also mean we need to trust client that it is actually transmitting us correct info, to double down on this I think since we're making OSDMap::any_osd_laggy() better, MDS can make use of it and if it sees any OSD laggy there, and then make the decision but there are caveats to this as well - e.g. if the client is flushing huge buffers, it might not respond back to cap revoke in time and if another OSD got laggy in the cluster, it could falsely conclude that it is because of "some laggy OSD" that the client is unable to revoke caps.

Actions

Copy link

#21

Updated by Venky Shankar about 1 month ago

Dhairya Parmar wrote:

Me and greg had a discussion on this today, I think the first step should be to remove the laggy client detection from Server::find_idle_sesisons since we all agreed that it is totally incorrect, I remember venky had a commit in https://github.com/ceph/ceph/pull/53578 that removed the detection code from the func. I think it would be great to have that commit (sorry venky, you were right at that time) AND better way of saying "an OSD is laggy".

I've been in and out making changes to that routine that I've lost track of what all was done :O

Anyway, now it does look like the only part we need to be bothered about is evict_cap_revoke_non_responders().

I'll try to get in touch with someone from RADOS and come up with more sensible parameters for OSDMap::any_osd_laggy(). So in summary the removal of laggy clients detection code from the Server::find_idle_sessions + enhancements to OSDMap::any_osd_laggy() is what I'm looking as the first step.

As far as next steps are concerned, I'm really looking forward to patrick's idea of making protocol changes to client code in a way that it says it is communicating with an OSD which it seems the comms are laggy, but it would also mean we need to trust client that it is actually transmitting us correct info, to double down on this I think since we're making OSDMap::any_osd_laggy() better,

So, this is what the flow would look like:

- client initiates a write out of buffered data to OSDs as part of cap revoke by the MDS
- objecter call is made to flush the data to a set of (file data) objects in the data pool
- for these set of objects, the client would need to know the id of the primary OSDs (is there an way to fetch this from the objecter?)
- include this information as part of RENEW_CAPS messages periodically sent to the MDS

Now, should this information always be sent by the clients to the MDS irrespective of whether the MDS is revoking caps or not? IMO, it should be conditional. Also, the list of primary OSDs that the client is flushing out the data to can be sent in an interval_set.

MDS can make use of it and if it sees any OSD laggy there, and then make the decision but there are caveats to this as well - e.g. if the client is flushing huge buffers, it might not respond back to cap revoke in time and if another OSD got laggy in the cluster, it could falsely conclude that it is because of "some laggy OSD" that the client is unable to revoke caps.

The MDS should know the list of OSDs that are laggy to compare it against the OSDs set sent by the client, isn't it?

Actions

Copy link

#22

Updated by Dhairya Parmar about 1 month ago

Venky Shankar wrote:

Dhairya Parmar wrote:

Me and greg had a discussion on this today, I think the first step should be to remove the laggy client detection from Server::find_idle_sesisons since we all agreed that it is totally incorrect, I remember venky had a commit in https://github.com/ceph/ceph/pull/53578 that removed the detection code from the func. I think it would be great to have that commit (sorry venky, you were right at that time) AND better way of saying "an OSD is laggy".

I've been in and out making changes to that routine that I've lost track of what all was done :O

Anyway, now it does look like the only part we need to be bothered about is evict_cap_revoke_non_responders().

I'll try to get in touch with someone from RADOS and come up with more sensible parameters for OSDMap::any_osd_laggy(). So in summary the removal of laggy clients detection code from the Server::find_idle_sessions + enhancements to OSDMap::any_osd_laggy() is what I'm looking as the first step.

As far as next steps are concerned, I'm really looking forward to patrick's idea of making protocol changes to client code in a way that it says it is communicating with an OSD which it seems the comms are laggy, but it would also mean we need to trust client that it is actually transmitting us correct info, to double down on this I think since we're making OSDMap::any_osd_laggy() better,

So, this is what the flow would look like:

- client initiates a write out of buffered data to OSDs as part of cap revoke by the MDS
- objecter call is made to flush the data to a set of (file data) objects in the data pool
- for these set of objects, the client would need to know the id of the primary OSDs (is there an way to fetch this from the objecter?)
- include this information as part of RENEW_CAPS messages periodically sent to the MDS

Now, should this information always be sent by the clients to the MDS irrespective of whether the MDS is revoking caps or not? IMO, it should be conditional. Also, the list of primary OSDs that the client is flushing out the data to can be sent in an interval_set.

MDS can make use of it and if it sees any OSD laggy there, and then make the decision but there are caveats to this as well - e.g. if the client is flushing huge buffers, it might not respond back to cap revoke in time and if another OSD got laggy in the cluster, it could falsely conclude that it is because of "some laggy OSD" that the client is unable to revoke caps.

The MDS should know the list of OSDs that are laggy to compare it against the OSDs set sent by the client, isn't it?

Oh right, is there any chance we can streamline this? I.e. lets say client is not sending the full list of the primary OSDs but only those which it feels are laggy, then the MDS can peek into the list it has which it gathered from OSDMap and say "okay you're right, these OSDs are really laggy" or "you lied to me, these OSDs are fine, you need to be evicted"?

Actions

Copy link

#23

Updated by Venky Shankar 29 days ago

Dhairya Parmar wrote:

Venky Shankar wrote:

Dhairya Parmar wrote:

Me and greg had a discussion on this today, I think the first step should be to remove the laggy client detection from Server::find_idle_sesisons since we all agreed that it is totally incorrect, I remember venky had a commit in https://github.com/ceph/ceph/pull/53578 that removed the detection code from the func. I think it would be great to have that commit (sorry venky, you were right at that time) AND better way of saying "an OSD is laggy".

I've been in and out making changes to that routine that I've lost track of what all was done :O

Anyway, now it does look like the only part we need to be bothered about is evict_cap_revoke_non_responders().

I'll try to get in touch with someone from RADOS and come up with more sensible parameters for OSDMap::any_osd_laggy(). So in summary the removal of laggy clients detection code from the Server::find_idle_sessions + enhancements to OSDMap::any_osd_laggy() is what I'm looking as the first step.

As far as next steps are concerned, I'm really looking forward to patrick's idea of making protocol changes to client code in a way that it says it is communicating with an OSD which it seems the comms are laggy, but it would also mean we need to trust client that it is actually transmitting us correct info, to double down on this I think since we're making OSDMap::any_osd_laggy() better,

So, this is what the flow would look like:

- client initiates a write out of buffered data to OSDs as part of cap revoke by the MDS
- objecter call is made to flush the data to a set of (file data) objects in the data pool
- for these set of objects, the client would need to know the id of the primary OSDs (is there an way to fetch this from the objecter?)
- include this information as part of RENEW_CAPS messages periodically sent to the MDS

Now, should this information always be sent by the clients to the MDS irrespective of whether the MDS is revoking caps or not? IMO, it should be conditional. Also, the list of primary OSDs that the client is flushing out the data to can be sent in an interval_set.

MDS can make use of it and if it sees any OSD laggy there, and then make the decision but there are caveats to this as well - e.g. if the client is flushing huge buffers, it might not respond back to cap revoke in time and if another OSD got laggy in the cluster, it could falsely conclude that it is because of "some laggy OSD" that the client is unable to revoke caps.

The MDS should know the list of OSDs that are laggy to compare it against the OSDs set sent by the client, isn't it?

Oh right, is there any chance we can streamline this? I.e. lets say client is not sending the full list of the primary OSDs but only those which it feels are laggy, then the MDS can peek into the list it has which it gathered from OSDMap and say "okay you're right, these OSDs are really laggy" or "you lied to me, these OSDs are fine, you need to be evicted"?

If the client knows its flushing data to a laggy OSD, then there isn't much to send the list of osd-id's to the MDS. IMO, the client should just send the primary osd-id's it flushing data to. Maybe, this set is a large.

Actions

Copy link

#24

Updated by Venky Shankar 18 days ago

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Actions

Copy link

#25

Updated by Dhairya Parmar 17 days ago

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

should there be some mechanism to cross check if clients are telling truth (from MDS side). patrick mentioned we shouldn't, whats your take, i'll move ahead with the consensus.

Actions

Copy link

#26

Updated by Venky Shankar 17 days ago

Dhairya Parmar wrote in #note-25:

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

should there be some mechanism to cross check if clients are telling truth (from MDS side). patrick mentioned we shouldn't, whats your take, i'll move ahead with the consensus.

Trust the list of OSDs they are flushing out the data to and sending that list out to the MDSs? I don't think we need to incorporate that.

Actions

Copy link

#27

Updated by Dhairya Parmar 17 days ago

Venky Shankar wrote in #note-26:

Dhairya Parmar wrote in #note-25:

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

should there be some mechanism to cross check if clients are telling truth (from MDS side). patrick mentioned we shouldn't, whats your take, i'll move ahead with the consensus.

Trust the list of OSDs they are flushing out the data to and sending that list out to the MDSs? I don't think we need to incorporate that.

mds should not trust client? i didnt get whether you're against it or for it.

Actions

Copy link

#28

Updated by Dhairya Parmar 17 days ago

as mentioned in yesterday's standup - some of the PRs (https://github.com/ceph/ceph/pull/49971, https://github.com/ceph/ceph/pull/53839) need to be reverted but there is also this PR that made changes to the client detection code https://github.com/ceph/ceph/pull/53578 that might also need to be reverted. I'll be waiting for some response on this before moving ahead.

Actions

Copy link

#29

Updated by Dhairya Parmar 17 days ago

Dhairya Parmar wrote in #note-27:

Venky Shankar wrote in #note-26:

Dhairya Parmar wrote in #note-25:

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

should there be some mechanism to cross check if clients are telling truth (from MDS side). patrick mentioned we shouldn't, whats your take, i'll move ahead with the consensus.

Trust the list of OSDs they are flushing out the data to and sending that list out to the MDSs? I don't think we need to incorporate that.

mds should not trust client? i didnt get whether you're against it or for it.

and yes this too is pending decision

Actions

Copy link

#30

Updated by Venky Shankar 16 days ago

Dhairya Parmar wrote in #note-27:

Venky Shankar wrote in #note-26:

Dhairya Parmar wrote in #note-25:

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

Venky Shankar wrote in #note-24:

Dhariya,

Anything blocking w.r.t. the design for this enhancement? The laggy OSD list is obviously something that needs to be constructed, but apart from that, from the client/mds POV, is there anything else pending?

should there be some mechanism to cross check if clients are telling truth (from MDS side). patrick mentioned we shouldn't, whats your take, i'll move ahead with the consensus.

Trust the list of OSDs they are flushing out the data to and sending that list out to the MDSs? I don't think we need to incorporate that.

mds should not trust client? i didnt get whether you're against it or for it.

OK. I'll elaborate. Generally, clients are not trustable - someone can hook up a patched up client that can trick the server into functioning inappropriately. For this case, IMO, it's fine for the MDS to use the list of OSDs sent by the client to cross-check if some of those OSDs are laggy and therefore delay eviction. Now, someone could deliberately always send an osd-id that's laggy and thereby evade client eviction, but to infer than accurately isn't trivial.

Actions

Copy link

#31

Updated by Dhairya Parmar 15 days ago

apart from the discussion about MDS verifying clients OSDs set, https://tracker.ceph.com/issues/64563#note-28 also needs to be addressed

Actions

Copy link

#32

Updated by Venky Shankar 15 days ago

Dhairya Parmar wrote in #note-28:

as mentioned in yesterday's standup - some of the PRs (https://github.com/ceph/ceph/pull/49971, https://github.com/ceph/ceph/pull/53839) need to be reverted but there is also this PR that made changes to the client detection code https://github.com/ceph/ceph/pull/53578 that might also need to be reverted. I'll be waiting for some response on this before moving ahead.

Yes, those need to be reverted. You can have those reverted in the PR your create for this feature.

Actions

Copy link

#33

Updated by Greg Farnum 9 days ago

Venky Shankar wrote in #note-30:

OK. I'll elaborate. Generally, clients are not trustable - someone can hook up a patched up client that can trick the server into functioning inappropriately. For this case, IMO, it's fine for the MDS to use the list of OSDs sent by the client to cross-check if some of those OSDs are laggy and therefore delay eviction. Now, someone could deliberately always send an osd-id that's laggy and thereby evade client eviction, but to infer than accurately isn't trivial.

CephFS clients are trusted by default — they own part of our cache. Yes, we try to minimize that surface area where we can, but not trusting them on this seems far to the side of where we normally err. It's just denying data access to things the client owns!

But really my main concern is that verifying a client's report, and doing it well, is hard. We only have remote checks like if we happen to have our own IO to the same OSD, or the lagginess values in the OSDMap. There are lots of ways an OSD could be slowing down a client without showing up in those metrics, and I think we should trust the client rather than doing a more complicated thing to test for malicious code?

Actions

Copy link

#34

Updated by Patrick Donnelly 9 days ago

Venky Shankar wrote in #note-32:

Dhairya Parmar wrote in #note-28:

as mentioned in yesterday's standup - some of the PRs (https://github.com/ceph/ceph/pull/49971, https://github.com/ceph/ceph/pull/53839) need to be reverted but there is also this PR that made changes to the client detection code https://github.com/ceph/ceph/pull/53578 that might also need to be reverted. I'll be waiting for some response on this before moving ahead.

Yes, those need to be reverted. You can have those reverted in the PR your create for this feature.

Actions

Copy link

#35

Updated by Patrick Donnelly 9 days ago

Greg Farnum wrote in #note-33:

Venky Shankar wrote in #note-30:

OK. I'll elaborate. Generally, clients are not trustable - someone can hook up a patched up client that can trick the server into functioning inappropriately. For this case, IMO, it's fine for the MDS to use the list of OSDs sent by the client to cross-check if some of those OSDs are laggy and therefore delay eviction. Now, someone could deliberately always send an osd-id that's laggy and thereby evade client eviction, but to infer than accurately isn't trivial.

CephFS clients are trusted by default — they own part of our cache. Yes, we try to minimize that surface area where we can, but not trusting them on this seems far to the side of where we normally err. It's just denying data access to things the client owns!

But really my main concern is that verifying a client's report, and doing it well, is hard. We only have remote checks like if we happen to have our own IO to the same OSD, or the lagginess values in the OSDMap. There are lots of ways an OSD could be slowing down a client without showing up in those metrics, and I think we should trust the client rather than doing a more complicated thing to test for malicious code?

I agree. It's not productive to check if the client is correct (at this time).

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #64563

mds: enhance laggy clients detections due to laggy OSDs

Updated by Venky Shankar 2 months ago

Updated by Venky Shankar 2 months ago

Updated by Dhairya Parmar 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Patrick Donnelly about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Dhairya Parmar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Dhairya Parmar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Dhairya Parmar about 1 month ago

Updated by Venky Shankar 29 days ago

Updated by Venky Shankar 18 days ago

Updated by Dhairya Parmar 17 days ago

Updated by Venky Shankar 17 days ago

Updated by Dhairya Parmar 17 days ago

Updated by Dhairya Parmar 17 days ago

Updated by Dhairya Parmar 17 days ago

Updated by Venky Shankar 16 days ago

Updated by Dhairya Parmar 15 days ago

Updated by Venky Shankar 15 days ago

Updated by Greg Farnum 9 days ago

Updated by Patrick Donnelly 9 days ago

Updated by Patrick Donnelly 9 days ago