Project

General

Profile

Actions

Support #57134

closed

MDS Services failing without error.

Added by Brian Woods over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Tags:
Reviewed:
Affected Versions:
Pull request ID:

Description

Note: This is a duplicate of #57082 as I created it in the wrong location and now can't change it.

So, just to start, this is a training cluster, so if my data is gone, no big deal.

The other night I was trying to optimize my writeback cache settings and set the cache_max_bytes to 0 by mistake and it used up all my SSD capacity that also had my meta data pools (95%). DOH! Lesson learned for prod!

This of course broke things. On top of this I had one of the nodes with SSDs die due to an onboard nic issue (I didn’t know this yet as it showed healthy), but I still had my other node with with a valid copy healthy (2x rep).

It was almost completely unresponsive to commands, rados df would just hang, same with cache evict.
I did a:
cephfs-data-scan scan_links
And it rados df started working, but still no evict. Here I figured out that one node was having issue and shut it down (found errors in the log). As soon as I did, the evict started going.

YAY! Plenty of free space.

All but one of my MSDs was down however. Restarting the nodes made no difference and I could not mount ceph fs. A service will start and then crash several minutes later...

I tried redeploying then, no change. I even tried completely removing the MDSs and redeploying them, no change. Tried re-deploying them with a different name, no change. The real problem I am having is there is nothing in the logs as to why the service is crashing other than “laggy or crashed”...

Sample log here:
https://pastebin.com/TpMAcuHF

I then tried:
ceph mds fail
And:
ceph fs reset

No change. I could delete the CephFS, but I am not sure if that would fix it, and I would also like to learn how to recover from this.

All pools are healthy, again, no errors that I can see in the logs, so IDK were to go from here.

Thoughts? Again, this is for learning, so destructive testing is totally okay!

Actions #1

Updated by Brian Woods over 1 year ago

I need to move on with my testing soon. If no one wants to debug this soon, I will be purging everything....

Actions #2

Updated by Greg Farnum over 1 year ago

  • Tracker changed from Bug to Support
  • Status changed from New to Closed

It sounds like the underlying RADOS cluster is still not healthy, and the mds is waiting to be able to do IO. That will cause it to fail heart beating and get failed over as you’re seeing.

We could try and be a little more informative since any mds will be stuck and failing over doesn’t help, but that requires quite a lot of intelligence at awkward places so it’s not there now.

Actions #3

Updated by Brian Woods over 1 year ago

Greg Farnum wrote:

It sounds like the underlying RADOS cluster is still not healthy, and the mds is waiting to be able to do IO. That will cause it to fail heart beating and get failed over as you’re seeing.

Could this also be causing the issues in #57135? Seems like a 0 or -1 is being returned to some portion of the code causing placement issues.

From all checks that I can think to run (still new), it looks healthy. Is there any tests I can perform to validate that RADOS is happy and/or in a bad state?

Thanks!!!

Actions #4

Updated by Brian Woods over 1 year ago

Brian Woods wrote:

it looks healthy

To be clear, that was before attempting to create new pools. Now new pools being created instantly have placement issues.

Actions

Also available in: Atom PDF