Project

General

Profile

Actions

Bug #54547

closed

Deferred writes might cause "rocksdb: Corruption: Bad table magic number"

Added by Igor Fedotov about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like under some circumstances deferred write op might persist in DB longer then allocated extents it's targeted to.
Hence this could cause data corruption if those extents are reallocated and overwritten by e.g. RocksDB/BlueFS on OSD startup. Following deferred write commit would invalidate new data in that case...


Files

squeezed_log_for_ticket.txt (9.86 KB) squeezed_log_for_ticket.txt log snippet showing issue evolution with comments Igor Fedotov, 03/14/2022 12:54 PM

Related issues 3 (0 open3 closed)

Has duplicate bluestore - Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" errorDuplicate

Actions
Copied to bluestore - Backport #56668: quincy: Deferred writes might cause "rocksdb: Corruption: Bad table magic number"ResolvedAdam KupczykActions
Copied to bluestore - Backport #56669: pacific: Deferred writes might cause "rocksdb: Corruption: Bad table magic number"ResolvedActions
Actions #2

Updated by Igor Fedotov about 2 years ago

  • Is duplicate of Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added
Actions #3

Updated by Igor Fedotov about 2 years ago

  • Status changed from New to Duplicate
Actions #4

Updated by Igor Fedotov about 2 years ago

  • Is duplicate of deleted (Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error)
Actions #5

Updated by Igor Fedotov about 2 years ago

  • Status changed from Duplicate to New
Actions #6

Updated by Igor Fedotov about 2 years ago

  • Status changed from New to Triaged
Actions #7

Updated by Igor Fedotov about 2 years ago

  • Subject changed from Deferred writes might cause data corruption to Deferred writes might cause "rocksdb: Corruption: Bad table magic number"
Actions #8

Updated by Igor Fedotov about 2 years ago

  • Has duplicate Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added
Actions #9

Updated by Igor Fedotov about 2 years ago

I managed to reproduce (to a major degree) the bug with vstart-ed cluster:
- osd_fast_shutdown = true
- rbd above ec 2+1 pool
- dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd count=5 bs=4096
kill <ceph-osd-pid>

as a result on a subsequent ceph-osd start one can see the following output (with debug-bluestore = 20)
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x42000~3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x42000 length 0x3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x51000~6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x51000 length 0x6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x5a000~18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x5a000 length 0x18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next end

022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 4 0x51000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 5 0x54000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 6 0x57000~3000

Please note extents 0x51000~3000 and 0x54000~3000 are marked as free in freelist while still being accessed by deferred write procedure.
If bluefs allocates these extents on startup - original DB corruption would occur. Which is rather a seldom case though...

Actions #10

Updated by Igor Fedotov about 2 years ago

a couple addendums to the previous comment:
- vstart cluster above should use spinning drives or benefit from setting bluestore debug enforce settings = hdd to get deferred writes
- it's pacific release which I've been using. Quincy one generally suffers from the same but NCB stuff might hide the allocation conflicts

Actions #11

Updated by Adam Kupczyk almost 2 years ago

https://github.com/ceph/ceph/pull/46856 is a consistent replicator for deferred writes corrupting RocksDB.

Actions #12

Updated by Adam Kupczyk almost 2 years ago

  • Backport set to octopus, pacific, quincy
  • Pull request ID set to 46890
Actions #14

Updated by Neha Ojha almost 2 years ago

  • Status changed from Triaged to Pending Backport
  • Backport changed from octopus, pacific, quincy to pacific, quincy
Actions #15

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #56668: quincy: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added
Actions #16

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #56669: pacific: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added
Actions #17

Updated by Igor Fedotov almost 2 years ago

  • Assignee set to Adam Kupczyk
Actions #18

Updated by Backport Bot almost 2 years ago

  • Tags set to backport_processed
Actions #19

Updated by Igor Fedotov almost 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF