Bug #57082: MDS Services failing without error. - cephsqlite - Ceph

Actions

Copy link

Bug #57082

closed

MDS Services failing without error.

Added by Brian Woods over 1 year ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.3

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

So, just to start, this is a training cluster, so if my data is gone, no big deal.

The other night I was trying to optimize my writeback cache settings and set the cache_max_bytes to 0 by mistake and it used up all my SSD capacity that also had my meta data pools (95%). DOH! Lesson learned for prod!

This of course broke things. On top of this I had one of the nodes with SSDs die due to an onboard nic issue (I didn’t know this yet as it showed healthy), but I still had my other node with with a valid copy healthy (2x rep).

It was almost completely unresponsive to commands, rados df would just hang, same with cache evict.
I did a:
cephfs-data-scan scan_links
And it rados df started working, but still no evict. Here I figured out that one node was having issue and shut it down (found errors in the log). As soon as I did, the evict started going.

YAY! Plenty of free space.

All but one of my MSDs was down however. Restarting the nodes made no difference and I could not mount ceph fs. A service will start and then crash several minutes later...

I tried redeploying then, no change. I even tried completely removing the MDSs and redeploying them, no change. Tried re-deploying them with a different name, no change. The real problem I am having is there is nothing in the logs as to why the service is crashing other than “laggy or crashed”...

Sample log here:
https://pastebin.com/TpMAcuHF

I then tried:
ceph mds fail
And:
ceph fs reset

No change. I could delete the Ceph FS, but I am not sure if that would fix it, and I would also like to learn how to recover from this.

All pools are healthy, again, no errors that I can see in the logs, so IDK were to go from here.

Thoughts? Again, this is for learning, so destructive testing is totally okay!