Project

General

Profile

Bug #6141

OSDs crash on recovery

Added by Niklas Goerke over 10 years ago. Updated over 9 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After (mistakenly) executing "echo 2 > /proc/sys/vm/drop_caches" instead of "echo 1 > /proc/sys/vm/drop_caches" to clean filesystem caches for performance testing my (test-) cluster crashed. I finally got it online about 24 hours later after discovering that my machines did not have enough pids for the 1.4 Million Threads that ceph spawned (of which 33k seem to be still running).
(No real ceph problem until now)

But now my osds start failing one after another (see log files attached). It feels like from my 180 OSDs about 1 fails in 10 Minutes. It wont come online on its own again but will when started manually.
The machine from which this logfile and objdump are taken hosts 15 OSDs of which No. 160 was down when the objdump was created.

ceph-osd.160.log View - osd logfile (3.15 MB) Niklas Goerke, 08/28/2013 09:39 AM

objdump - #objdump -rdS /usr/bin/ceph-osd >objdump (66.3 MB) Niklas Goerke, 08/28/2013 09:39 AM

ceph-osd.120.log.1 - OSD that crashed initially - NOT directly bug related! (2.63 MB) Niklas Goerke, 08/28/2013 10:16 AM

ceph-osd.160.log.1 - OSD that did not crash initially - NOT directly bug related! (5.89 MB) Niklas Goerke, 08/28/2013 10:16 AM

History

#1 Updated by Greg Farnum over 10 years ago

What was happening on your cluster at the time you dropped the caches? There are internal and external limits well below a million threads, although 33k across 15 OSDs is possible due to our network architecture. And dropping the dentry cache should definitely do nasty things to performance (especially under load), but I can't think of how that would inflate the thread count by much of anything.

#2 Updated by Niklas Goerke over 10 years ago

Dropping the caches was not a problem. Freeing Dentries an Inodes took about 30 Minutes and I guess ceph was not able to access its files in that time and thus ran into a timeout and crashed. I did not record any load data though.
But this is only the story which led to the crash of my cluster in the first place. I don't want to blame ceph for crashing that time. I'll still attach the log file of that Crash tough.
The logfiles attached is NOT (directly) related to the Bug

The Bug is that my osds keep on crashing now, where there is no unusual activity on my hosts (except for a recovering ceph).

#3 Updated by Ian Colle over 10 years ago

  • Assignee set to Greg Farnum

#4 Updated by Greg Farnum almost 10 years ago

  • Assignee deleted (Greg Farnum)

#5 Updated by Samuel Just over 9 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF