Bug #54547
closedDeferred writes might cause "rocksdb: Corruption: Bad table magic number"
0%
Description
Looks like under some circumstances deferred write op might persist in DB longer then allocated extents it's targeted to.
Hence this could cause data corruption if those extents are reallocated and overwritten by e.g. RocksDB/BlueFS on OSD startup. Following deferred write commit would invalidate new data in that case...
Files
Updated by Igor Fedotov about 2 years ago
Updated by Igor Fedotov about 2 years ago
- Is duplicate of Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added
Updated by Igor Fedotov about 2 years ago
- Is duplicate of deleted (Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error)
Updated by Igor Fedotov about 2 years ago
- Subject changed from Deferred writes might cause data corruption to Deferred writes might cause "rocksdb: Corruption: Bad table magic number"
Updated by Igor Fedotov about 2 years ago
- Has duplicate Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added
Updated by Igor Fedotov about 2 years ago
I managed to reproduce (to a major degree) the bug with vstart-ed cluster:
- osd_fast_shutdown = true
- rbd above ec 2+1 pool
- dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd count=5 bs=4096
kill <ceph-osd-pid>
as a result on a subsequent ceph-osd start one can see the following output (with debug-bluestore = 20)
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x42000~3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x42000 length 0x3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x51000~6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x51000 length 0x6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x5a000~18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x5a000 length 0x18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next end
022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 4 0x51000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 5 0x54000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 6 0x57000~3000
Please note extents 0x51000~3000 and 0x54000~3000 are marked as free in freelist while still being accessed by deferred write procedure.
If bluefs allocates these extents on startup - original DB corruption would occur. Which is rather a seldom case though...
Updated by Igor Fedotov about 2 years ago
a couple addendums to the previous comment:
- vstart cluster above should use spinning drives or benefit from setting bluestore debug enforce settings = hdd to get deferred writes
- it's pacific release which I've been using. Quincy one generally suffers from the same but NCB stuff might hide the allocation conflicts
Updated by Adam Kupczyk almost 2 years ago
https://github.com/ceph/ceph/pull/46856 is a consistent replicator for deferred writes corrupting RocksDB.
Updated by Adam Kupczyk almost 2 years ago
- Backport set to octopus, pacific, quincy
- Pull request ID set to 46890
Updated by Yuri Weinstein almost 2 years ago
Updated by Neha Ojha almost 2 years ago
- Status changed from Triaged to Pending Backport
- Backport changed from octopus, pacific, quincy to pacific, quincy
Updated by Backport Bot almost 2 years ago
- Copied to Backport #56668: quincy: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added
Updated by Backport Bot almost 2 years ago
- Copied to Backport #56669: pacific: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" added
Updated by Igor Fedotov almost 2 years ago
- Status changed from Pending Backport to Resolved