Bug #61947
closedmds: enforce a limit on the size of a session in the sessionmap
0%
Description
If the session's "completed_requests" vector gets too large, the session can get to a size where the MDS goes read-only because the OSD rejects sessionmap object updates with "Message size too long".
2023-07-10 13:53:30.529 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:30.529 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:30.530 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:30.534 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:30.534 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:30.534 7f8fed08b700 0 log_channel(cluster) log [WRN] : client.744507717 does not advance its oldest_client_tid (3221389957), 5905929 completed requests recorded in session 2023-07-10 13:53:35.635 7f8fe687e700 -1 mds.0.2679609 unhandled write error (90) Message too long, force readonly... 2023-07-10 13:53:35.635 7f8fe687e700 1 mds.0.cache force file system read-only 2023-07-10 13:53:35.635 7f8fe687e700 0 log_channel(cluster) log [WRN] : force file system read-only
If a session exceeds some configurable encoded size (maybe 16MB), then evict it.
Updated by Venky Shankar 10 months ago
This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.
Patrick, was the client ceph-mgr in this case?
Updated by Patrick Donnelly 10 months ago
Venky Shankar wrote:
This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.
There is a genuine bug somewhere that needs tracked down but the MDS shouldn't fail like this if a client is buggy.
Patrick, was the client ceph-mgr in this case?
No, it was Ganesha.
Updated by Venky Shankar 10 months ago
Patrick Donnelly wrote:
Venky Shankar wrote:
This one's interesting. I did mention in the standup yesterday that I've seen this earlier and that cluster too had NFS ganesha, however, I digged up the BZ and surprisingly the client was ceph-mgr which was building up lots of completed_requests and that resulted in journal I/O affecting performance. One thing that was suspected back then was selinux relabeling on the PVC that somehow caused lots of unacknowledged ops to get build up for ceph-mgr.
There is a genuine bug somewhere that needs tracked down but the MDS shouldn't fail like this if a client is buggy.
Yeh. For now, maybe blocklist the client if the completed_request count shoots above a limit.
It could be a buggy client or a bug in the MDS - we've seen reports where the clients are ceph-mgr (libcephfs) and even kclient. I feel a certain code path is missing marking the session as dirty (in LogSegment::touched_sessions).
EDIT: In the sense, not just delaying persisting the session map, but also accumulating it it memory.
Updated by Venky Shankar 9 months ago
Leonid, I forgot to update the tracker assignee post our sync. I've done implementing 50% of the work.
Updated by Venky Shankar 9 months ago
- Assignee changed from Leonid Usov to Venky Shankar
Updated by Venky Shankar 9 months ago
- Status changed from New to Fix Under Review
- Pull request ID set to 52944
Updated by Venky Shankar 8 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 8 months ago
- Copied to Backport #62583: reef: mds: enforce a limit on the size of a session in the sessionmap added
Updated by Backport Bot 8 months ago
- Copied to Backport #62584: pacific: mds: enforce a limit on the size of a session in the sessionmap added
Updated by Backport Bot 8 months ago
- Copied to Backport #62585: quincy: mds: enforce a limit on the size of a session in the sessionmap added
Updated by Xiubo Li 6 months ago
- Related to Bug #63364: MDS_CLIENT_OLDEST_TID: 15 clients failing to advance oldest client/flush tid added
Updated by Konstantin Shalygin 5 months ago
- Status changed from Pending Backport to Resolved
Updated by Niklas Hambuechen about 2 months ago
Potentially related issue:
- https://tracker.ceph.com/issues/64852 - MDS hangs on "joining batch getattr" when client does statx