Bug #61947: mds: enforce a limit on the size of a session in the sessionmap - CephFS - Ceph

Actions

Copy link

Bug #61947

closed

mds: enforce a limit on the size of a session in the sessionmap

Added by Patrick Donnelly 10 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

52944

Crash signature (v1):

Crash signature (v2):

Description

If the session's "completed_requests" vector gets too large, the session can get to a size where the MDS goes read-only because the OSD rejects sessionmap object updates with "Message size too long".

2023-07-10 13:53:30.529 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.529 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.530 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:30.534 7f8fed08b700  0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session
2023-07-10 13:53:35.635 7f8fe687e700 -1 mds.0.2679609 unhandled write error (90) Message too long, force readonly...
2023-07-10 13:53:35.635 7f8fe687e700  1 mds.0.cache force file system read-only
2023-07-10 13:53:35.635 7f8fe687e700  0 log_channel(cluster) log [WRN] : force file system read-only

If a session exceeds some configurable encoded size (maybe 16MB), then evict it.

Subtasks 1 (1 open — 0 closed)

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by Venky Shankar 10 months ago

This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.

Patrick, was the client ceph-mgr in this case?

Actions

Copy link

Updated by Patrick Donnelly 10 months ago

Venky Shankar wrote:

This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.

There is a genuine bug somewhere that needs tracked down but the MDS shouldn't fail like this if a client is buggy.

Patrick, was the client ceph-mgr in this case?

No, it was Ganesha.

Actions

Copy link

Updated by Venky Shankar 10 months ago

Patrick Donnelly wrote:

Venky Shankar wrote:

This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.

There is a genuine bug somewhere that needs tracked down but the MDS shouldn't fail like this if a client is buggy.

Yeh. For now, maybe blocklist the client if the completed_request count shoots above a limit.

It could be a buggy client or a bug in the MDS - we've seen reports where the clients are ceph-mgr (libcephfs) and even kclient. I feel a certain code path is missing marking the session as dirty (in LogSegment::touched_sessions).

EDIT: In the sense, not just delaying persisting the session map, but also accumulating it it memory.

Actions

Copy link