Bug #6141
closed
Added by Niklas Goerke over 10 years ago.
Updated almost 10 years ago.
Description
After (mistakenly) executing "echo 2 > /proc/sys/vm/drop_caches" instead of "echo 1 > /proc/sys/vm/drop_caches" to clean filesystem caches for performance testing my (test-) cluster crashed. I finally got it online about 24 hours later after discovering that my machines did not have enough pids for the 1.4 Million Threads that ceph spawned (of which 33k seem to be still running).
(No real ceph problem until now)
But now my osds start failing one after another (see log files attached). It feels like from my 180 OSDs about 1 fails in 10 Minutes. It wont come online on its own again but will when started manually.
The machine from which this logfile and objdump are taken hosts 15 OSDs of which No. 160 was down when the objdump was created.
Files
What was happening on your cluster at the time you dropped the caches? There are internal and external limits well below a million threads, although 33k across 15 OSDs is possible due to our network architecture. And dropping the dentry cache should definitely do nasty things to performance (especially under load), but I can't think of how that would inflate the thread count by much of anything.
Dropping the caches was not a problem. Freeing Dentries an Inodes took about 30 Minutes and I guess ceph was not able to access its files in that time and thus ran into a timeout and crashed. I did not record any load data though.
But this is only the story which led to the crash of my cluster in the first place. I don't want to blame ceph for crashing that time. I'll still attach the log file of that Crash tough.
The logfiles attached is NOT (directly) related to the Bug
The Bug is that my osds keep on crashing now, where there is no unusual activity on my hosts (except for a recovering ceph).
- Assignee set to Greg Farnum
- Assignee deleted (
Greg Farnum)
- Status changed from New to Can't reproduce
Also available in: Atom
PDF