Project

General

Profile

Bug #54547

Deferred writes might cause "rocksdb: Corruption: Bad table magic number"

Added by Igor Fedotov 7 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
pacific, quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Looks like under some circumstances deferred write op might persist in DB longer then allocated extents it's targeted to.
Hence this could cause data corruption if those extents are reallocated and overwritten by e.g. RocksDB/BlueFS on OSD startup. Following deferred write commit would invalidate new data in that case...

squeezed_log_for_ticket.txt View - log snippet showing issue evolution with comments (9.86 KB) Igor Fedotov, 03/14/2022 12:54 PM


Related issues

Duplicated by bluestore - Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error Duplicate
Copied to bluestore - Backport #56668: quincy: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" Resolved
Copied to bluestore - Backport #56669: pacific: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" Resolved

History

#2 Updated by Igor Fedotov 7 months ago

  • Duplicates Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added

#3 Updated by Igor Fedotov 7 months ago

  • Status changed from New to Duplicate

#4 Updated by Igor Fedotov 7 months ago

  • Duplicates deleted (Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error)

#5 Updated by Igor Fedotov 7 months ago

  • Status changed from Duplicate to New

#6 Updated by Igor Fedotov 7 months ago

  • Status changed from New to Triaged

#7 Updated by Igor Fedotov 7 months ago

  • Subject changed from Deferred writes might cause data corruption to Deferred writes might cause "rocksdb: Corruption: Bad table magic number"

#8 Updated by Igor Fedotov 7 months ago

  • Duplicated by Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added

#9 Updated by Igor Fedotov 7 months ago

I managed to reproduce (to a major degree) the bug with vstart-ed cluster:
- osd_fast_shutdown = true
- rbd above ec 2+1 pool
- dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd count=5 bs=4096
kill <ceph-osd-pid>

as a result on a subsequent ceph-osd start one can see the following output (with debug-bluestore = 20)
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x42000~3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x42000 length 0x3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x51000~6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x51000 length 0x6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x5a000~18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x5a000 length 0x18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next end

022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 4 0x51000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 5 0x54000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 6 0x57000~3000

Please note extents 0x51000~3000 and 0x54000~3000 are marked as free in freelist while still being accessed by deferred write procedure.
If bluefs allocates these extents on startup - original DB corruption would occur. Which is rather a seldom case though...

#10 Updated by Igor Fedotov 7 months ago

a couple addendums to the previous comment:
- vstart cluster above should use spinning drives or benefit from setting bluestore debug enforce settings = hdd to get deferred writes
- it's pacific release which I've been using. Quincy one generally suffers from the same but NCB stuff might hide the allocation conflicts

#11 Updated by Adam Kupczyk 3 months ago

https://github.com/ceph/ceph/pull/46856 is a consistent replicator for deferred writes corrupting RocksDB.

#12 Updated by Adam Kupczyk 3 months ago

  • Backport set to octopus, pacific, quincy
  • Pull request ID set to 46890

#14 Updated by Neha Ojha 2 months ago

  • Status changed from Triaged to Pending Backport
  • Backport changed from octopus, pacific, quincy to pacific, quincy

#15 Updated by Backport Bot 2 months ago

  • Copied to Backport #56668: quincy: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added

#16 Updated by Backport Bot 2 months ago

  • Copied to Backport #56669: pacific: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added

#17 Updated by Igor Fedotov 2 months ago

  • Assignee set to Adam Kupczyk

#18 Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed

#19 Updated by Igor Fedotov about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF