Bug #57082
closedMDS Services failing without error.
0%
Description
So, just to start, this is a training cluster, so if my data is gone, no big deal.
The other night I was trying to optimize my writeback cache settings and set the cache_max_bytes to 0 by mistake and it used up all my SSD capacity that also had my meta data pools (95%). DOH! Lesson learned for prod!
This of course broke things. On top of this I had one of the nodes with SSDs die due to an onboard nic issue (I didn’t know this yet as it showed healthy), but I still had my other node with with a valid copy healthy (2x rep).
It was almost completely unresponsive to commands, rados df would just hang, same with cache evict.
I did a:
cephfs-data-scan scan_links
And it rados df started working, but still no evict. Here I figured out that one node was having issue and shut it down (found errors in the log). As soon as I did, the evict started going.
YAY! Plenty of free space.
All but one of my MSDs was down however. Restarting the nodes made no difference and I could not mount ceph fs. A service will start and then crash several minutes later...
I tried redeploying then, no change. I even tried completely removing the MDSs and redeploying them, no change. Tried re-deploying them with a different name, no change. The real problem I am having is there is nothing in the logs as to why the service is crashing other than “laggy or crashed”...
Sample log here:
https://pastebin.com/TpMAcuHF
I then tried:
ceph mds fail
And:
ceph fs reset
No change. I could delete the Ceph FS, but I am not sure if that would fix it, and I would also like to learn how to recover from this.
All pools are healthy, again, no errors that I can see in the logs, so IDK were to go from here.
Thoughts? Again, this is for learning, so destructive testing is totally okay!