Bug #5723
OSD seemingly loses objects during crash
0%
Description
I had some vm's with the qemu-rbd driver doing a trim operation. One of my osd's crashed and now has an inconsistent pg following restart. OSD 14 was the primary and crashed while OSD 6 is the secondary. The missing objects are like this:
2.37d osd.14 missing f30b0f7d/rb.0.105b.238e1f29.000000000ff4/head//2
However it appears to be on disk on both the primary and secondary, just in different places in the directory tree depending on the OSD
find /data/osd.14/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.14/current/2.37d_head/DIR_D/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2
find /data/osd.6/current/2.37d_head/ -name 'rb.0.105b.238e1f29.000000000ff4*'
/data/osd.6/current/2.37d_head/DIR_D/DIR_7/rb.0.105b.238e1f29.000000000ff4__head_F30B0F7D__2
Associated revisions
HashIndex: reset attr upon split or merge completion
A replay of an in progress merge or split might make
our counts unreliable.
Fixes: #5723
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
HashIndex: reset attr upon split or merge completion
A replay of an in progress merge or split might make
our counts unreliable.
Fixes: #5723
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 0dc3efdd885377a07987d868af5bb7a38245c90b)
History
#1 Updated by Mike Lowe over 10 years ago
OSD crashed without writing any logs
#2 Updated by Samuel Just over 10 years ago
HashIndex merge needs to verify the collection contents before merging. In the mean time, you can recover by adjusting the cephos.phash.contents for DIR_D/DIR_7 from
(02:34:44 PM) jmlowe1: cephos.phash.contents
(02:34:44 PM) jmlowe1: 0000000: 0109 0000 0000 0000 0000 0000 0002 0000 ................
(02:34:44 PM) jmlowe1: 0000010: 00
to
(03:34:13 PM) sjust: 0000000: 01dc 0000 0000 0000 0002 0000 0002 0000
(03:34:13 PM) sjust: 0000010: 00
#3 Updated by Samuel Just over 10 years ago
- Assignee set to Samuel Just
- Priority changed from High to Urgent
#4 Updated by Samuel Just over 10 years ago
You also need to move all of the objects from DIR_D to DIR_D/DIR_7 again
#5 Updated by Mike Lowe over 10 years ago
removing cephos.phash.in_progress_op, setting cephos.phash.contents, moving the files and restarting seems to have resolved the missing objects
#6 Updated by Sage Weil over 10 years ago
- Status changed from New to Fix Under Review
#7 Updated by Samuel Just over 10 years ago
- Status changed from Fix Under Review to Pending Backport
#8 Updated by Sage Weil over 10 years ago
- Status changed from Pending Backport to Resolved