Project

General

Profile

Actions

Bug #52494

closed

Client servers lock up if Ceph cluster has problems

Added by David B over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
fs/ceph
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Hey,

We have a Ceph cluster and a kubernetes cluster on which pods use CephFS for storing files. When our Ceph cluster would sometimes encounter performance or other problems some of our kubernetes nodes would become unresponsive and the only solution would be to hard-reset them. This looked like bug https://tracker.ceph.com/issues/46284

We are running Debian 10 with a 5.10 kernel from backports which is based on 5.10.46 and doesn't have the fix for #46284 so I pulled down the upstream kernel git repo, found the commit with this fix in linux-5.10.y branch and made a patch from it. I then applied this patch on the Debian 5.10 kernel source, compiled and installed on half of our kubernetes nodes.

Recently we had jobs running on kubernetes that did a lot of file changes on CephFS and put a high load on MDS's. During that time 2 of our kubernetes nodes (from a total of 200 nodes) locked up, both were with the #46284 patch. This doesn't seem like the same bug, couldn't find any mentions of ceph_check_delayed_caps, usually there were mentions of ceph_flush_dirty_caps. I've attached a file with some logs from /var/log/messages of both nodes.

Now I don't know if it's a problem with our configuration, if I made mistakes compiling the kernel or did we encounter a new bug?

Any insight would be appreciated.


Files

ceph-errors (38.4 KB) ceph-errors David B, 09/02/2021 01:11 PM
ceph-errors-node039 (94.8 KB) ceph-errors-node039 David B, 09/07/2021 12:47 PM
Actions

Also available in: Atom PDF