Project

General

Profile

Actions

Bug #57082

closed

MDS Services failing without error.

Added by Brian Woods over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

So, just to start, this is a training cluster, so if my data is gone, no big deal.

The other night I was trying to optimize my writeback cache settings and set the cache_max_bytes to 0 by mistake and it used up all my SSD capacity that also had my meta data pools (95%). DOH! Lesson learned for prod!

This of course broke things. On top of this I had one of the nodes with SSDs die due to an onboard nic issue (I didn’t know this yet as it showed healthy), but I still had my other node with with a valid copy healthy (2x rep).

It was almost completely unresponsive to commands, rados df would just hang, same with cache evict.
I did a:
cephfs-data-scan scan_links
And it rados df started working, but still no evict. Here I figured out that one node was having issue and shut it down (found errors in the log). As soon as I did, the evict started going.

YAY! Plenty of free space.

All but one of my MSDs was down however. Restarting the nodes made no difference and I could not mount ceph fs. A service will start and then crash several minutes later...

I tried redeploying then, no change. I even tried completely removing the MDSs and redeploying them, no change. Tried re-deploying them with a different name, no change. The real problem I am having is there is nothing in the logs as to why the service is crashing other than “laggy or crashed”...

Sample log here:
https://pastebin.com/TpMAcuHF

I then tried:
ceph mds fail
And:
ceph fs reset

No change. I could delete the Ceph FS, but I am not sure if that would fix it, and I would also like to learn how to recover from this.

All pools are healthy, again, no errors that I can see in the logs, so IDK were to go from here.

Thoughts? Again, this is for learning, so destructive testing is totally okay!

Actions #1

Updated by Greg Farnum over 1 year ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF