Bug #6141
closedOSDs crash on recovery
0%
Description
After (mistakenly) executing "echo 2 > /proc/sys/vm/drop_caches" instead of "echo 1 > /proc/sys/vm/drop_caches" to clean filesystem caches for performance testing my (test-) cluster crashed. I finally got it online about 24 hours later after discovering that my machines did not have enough pids for the 1.4 Million Threads that ceph spawned (of which 33k seem to be still running).
(No real ceph problem until now)
But now my osds start failing one after another (see log files attached). It feels like from my 180 OSDs about 1 fails in 10 Minutes. It wont come online on its own again but will when started manually.
The machine from which this logfile and objdump are taken hosts 15 OSDs of which No. 160 was down when the objdump was created.
Files
Updated by Greg Farnum over 10 years ago
What was happening on your cluster at the time you dropped the caches? There are internal and external limits well below a million threads, although 33k across 15 OSDs is possible due to our network architecture. And dropping the dentry cache should definitely do nasty things to performance (especially under load), but I can't think of how that would inflate the thread count by much of anything.
Updated by Niklas Goerke over 10 years ago
- File ceph-osd.120.log.1 ceph-osd.120.log.1 added
- File ceph-osd.160.log.1 ceph-osd.160.log.1 added
Dropping the caches was not a problem. Freeing Dentries an Inodes took about 30 Minutes and I guess ceph was not able to access its files in that time and thus ran into a timeout and crashed. I did not record any load data though.
But this is only the story which led to the crash of my cluster in the first place. I don't want to blame ceph for crashing that time. I'll still attach the log file of that Crash tough.
The logfiles attached is NOT (directly) related to the Bug
The Bug is that my osds keep on crashing now, where there is no unusual activity on my hosts (except for a recovering ceph).
Updated by Samuel Just almost 10 years ago
- Status changed from New to Can't reproduce