Project

General

Profile

Actions

Bug #10435

closed

ceph-osd stops with "Caught signal (Aborted)" or "osd/PG.cc: 2683: FAILED assert(values.size() == 1)"

Added by Jamin Collins over 9 years ago. Updated over 9 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While my production ceph cluster was recovering from a power outage, a few of my OSDs started flapping and eventually went down. Previously, I've simply completely removed the OSDs and re-added them fresh and allowed the cluster to recover. However, the cluster is currently reporting a few items are "unfound" (3/939435 unfound (0.000%)) and I'm leery of completely removing OSDs in this state as I don't want to incur any data loss.

Digging through the archives and bug reports I've found a similar case1 with a request for reproduction with increased logging levels. I believe I've managed to gather the requested level of detail and will attach it to this report.

[1] - https://www.mail-archive.com/ceph-users@lists.ceph.com/msg01034.html


Files

ceph-osd.6.log.lzma (14.2 MB) ceph-osd.6.log.lzma attempted ceph-osd startup with debug options -- Caught signal (Aborted) Jamin Collins, 12/27/2014 12:30 PM
ceph-osd.11.log.lzma (13.7 MB) ceph-osd.11.log.lzma attempted ceph-osd startup with debug options -- osd/PG.cc: 2683: FAILED assert(values.size() == 1) Jamin Collins, 12/27/2014 12:33 PM
ceph-locate-unfound (419 Bytes) ceph-locate-unfound script used to check storage node for unfound objects Jamin Collins, 12/27/2014 01:17 PM
Actions #2

Updated by Jamin Collins over 9 years ago

Near as I can tell, all the unfound objects reside on osd.6:

$ ./ceph-locate-unfound
/var/lib/ceph/osd/ceph-6/current/3.2ba_head/DIR_A/DIR_B/DIR_2/rb.0.1da2e.238e1f29.000000000178__head_F23D22BA__3
/var/lib/ceph/osd/ceph-6/current/3.25f_head/DIR_F/DIR_5/DIR_E/rb.0.1175.2ae8944a.0000000024e0__head_B0B2CE5F__3
/var/lib/ceph/osd/ceph-6/current/3.199_head/DIR_9/DIR_9/DIR_D/rb.0.1da2e.238e1f29.0000000000b3__head_76DA7D99__3

Is there any way to move these objects to a working OSD or get osd.6 back to a point where ceph-osd can start on it?

Actions #3

Updated by Jamin Collins over 9 years ago

I've removed, erased, and re-added osd.11 to the ceph cluster.

Actions #4

Updated by Jamin Collins over 9 years ago

Having determined which RBD volumes these unfound OIDs belonged to, I've decided to remove osd.6, zero the drive, and re-add it.

Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from New to Closed

in certain cases it is possible to move the file, but in general no. we're working on a tool to move entire pgs at a time.

Actions

Also available in: Atom PDF