Project

General

Profile

Actions

Bug #5723

closed

OSD seemingly loses objects during crash

Added by Mike Lowe almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had some vm's with the qemu-rbd driver doing a trim operation. One of my osd's crashed and now has an inconsistent pg following restart. OSD 14 was the primary and crashed while OSD 6 is the secondary. The missing objects are like this:

2.37d osd.14 missing f30b0f7d/rb.0.105b.238e1f29.000000000ff4/head//2

However it appears to be on disk on both the primary and secondary, just in different places in the directory tree depending on the OSD

find /data/osd.14/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.14/current/2.37d_head/DIR_D/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2

find /data/osd.6/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.6/current/2.37d_head/DIR_D/DIR_7/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2

Actions #1

Updated by Mike Lowe almost 11 years ago

OSD crashed without writing any logs

Actions #2

Updated by Samuel Just almost 11 years ago

HashIndex merge needs to verify the collection contents before merging. In the mean time, you can recover by adjusting the cephos.phash.contents for DIR_D/DIR_7 from

(02:34:44 PM) jmlowe1: cephos.phash.contents
(02:34:44 PM) jmlowe1: 0000000: 0109 0000 0000 0000 0000 0000 0002 0000 ................
(02:34:44 PM) jmlowe1: 0000010: 00

to

(03:34:13 PM) sjust: 0000000: 01dc 0000 0000 0000 0002 0000 0002 0000
(03:34:13 PM) sjust: 0000010: 00

Actions #3

Updated by Samuel Just almost 11 years ago

  • Assignee set to Samuel Just
  • Priority changed from High to Urgent
Actions #4

Updated by Samuel Just almost 11 years ago

You also need to move all of the objects from DIR_D to DIR_D/DIR_7 again

Actions #5

Updated by Mike Lowe almost 11 years ago

removing cephos.phash.in_progress_op, setting cephos.phash.contents, moving the files and restarting seems to have resolved the missing objects

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from New to Fix Under Review
Actions #7

Updated by Samuel Just almost 11 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Sage Weil almost 11 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF