Project

General

Profile

Actions

Bug #6141

closed

OSDs crash on recovery

Added by Niklas Goerke over 10 years ago. Updated almost 10 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After (mistakenly) executing "echo 2 > /proc/sys/vm/drop_caches" instead of "echo 1 > /proc/sys/vm/drop_caches" to clean filesystem caches for performance testing my (test-) cluster crashed. I finally got it online about 24 hours later after discovering that my machines did not have enough pids for the 1.4 Million Threads that ceph spawned (of which 33k seem to be still running).
(No real ceph problem until now)

But now my osds start failing one after another (see log files attached). It feels like from my 180 OSDs about 1 fails in 10 Minutes. It wont come online on its own again but will when started manually.
The machine from which this logfile and objdump are taken hosts 15 OSDs of which No. 160 was down when the objdump was created.


Files

ceph-osd.160.log (3.15 MB) ceph-osd.160.log osd logfile Niklas Goerke, 08/28/2013 09:39 AM
objdump (66.3 MB) objdump #objdump -rdS /usr/bin/ceph-osd >objdump Niklas Goerke, 08/28/2013 09:39 AM
ceph-osd.120.log.1 (2.63 MB) ceph-osd.120.log.1 OSD that crashed initially - NOT directly bug related! Niklas Goerke, 08/28/2013 10:16 AM
ceph-osd.160.log.1 (5.89 MB) ceph-osd.160.log.1 OSD that did not crash initially - NOT directly bug related! Niklas Goerke, 08/28/2013 10:16 AM
Actions

Also available in: Atom PDF