Project

General

Profile

Bug #5723

OSD seemingly loses objects during crash

Added by Mike Lowe over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I had some vm's with the qemu-rbd driver doing a trim operation. One of my osd's crashed and now has an inconsistent pg following restart. OSD 14 was the primary and crashed while OSD 6 is the secondary. The missing objects are like this:

2.37d osd.14 missing f30b0f7d/rb.0.105b.238e1f29.000000000ff4/head//2

However it appears to be on disk on both the primary and secondary, just in different places in the directory tree depending on the OSD

find /data/osd.14/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.14/current/2.37d_head/DIR_D/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2

find /data/osd.6/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.6/current/2.37d_head/DIR_D/DIR_7/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2

Associated revisions

Revision 0dc3efdd (diff)
Added by Samuel Just over 10 years ago

HashIndex: reset attr upon split or merge completion

A replay of an in progress merge or split might make
our counts unreliable.

Fixes: #5723
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>

Revision b0535fcf (diff)
Added by Samuel Just over 10 years ago

HashIndex: reset attr upon split or merge completion

A replay of an in progress merge or split might make
our counts unreliable.

Fixes: #5723
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>
(cherry picked from commit 0dc3efdd885377a07987d868af5bb7a38245c90b)

History

#1 Updated by Mike Lowe over 10 years ago

OSD crashed without writing any logs

#2 Updated by Samuel Just over 10 years ago

HashIndex merge needs to verify the collection contents before merging. In the mean time, you can recover by adjusting the cephos.phash.contents for DIR_D/DIR_7 from

(02:34:44 PM) jmlowe1: cephos.phash.contents
(02:34:44 PM) jmlowe1: 0000000: 0109 0000 0000 0000 0000 0000 0002 0000 ................
(02:34:44 PM) jmlowe1: 0000010: 00

to

(03:34:13 PM) sjust: 0000000: 01dc 0000 0000 0000 0002 0000 0002 0000
(03:34:13 PM) sjust: 0000010: 00

#3 Updated by Samuel Just over 10 years ago

  • Assignee set to Samuel Just
  • Priority changed from High to Urgent

#4 Updated by Samuel Just over 10 years ago

You also need to move all of the objects from DIR_D to DIR_D/DIR_7 again

#5 Updated by Mike Lowe over 10 years ago

removing cephos.phash.in_progress_op, setting cephos.phash.contents, moving the files and restarting seems to have resolved the missing objects

#6 Updated by Sage Weil over 10 years ago

  • Status changed from New to Fix Under Review

#7 Updated by Samuel Just over 10 years ago

  • Status changed from Fix Under Review to Pending Backport

#8 Updated by Sage Weil over 10 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF