Bug #54547: Deferred writes might cause "rocksdb: Corruption: Bad table magic number" - bluestore - Ceph

Actions

Copy link

Bug #54547

closed

Deferred writes might cause "rocksdb: Corruption: Bad table magic number"

Added by Igor Fedotov about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Adam Kupczyk

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

pacific, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.7

ceph-qa-suite:

Pull request ID:

46890

Crash signature (v1):

Crash signature (v2):

Description

Looks like under some circumstances deferred write op might persist in DB longer then allocated extents it's targeted to.
Hence this could cause data corruption if those extents are reallocated and overwritten by e.g. RocksDB/BlueFS on OSD startup. Following deferred write commit would invalidate new data in that case...

Files

squeezed_log_for_ticket.txt (9.86 KB) squeezed_log_for_ticket.txt

log snippet showing issue evolution with comments

Igor Fedotov, 03/14/2022 12:54 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

File squeezed_log_for_ticket.txt squeezed_log_for_ticket.txt added

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Is duplicate of Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Status changed from New to Duplicate

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Is duplicate of deleted (Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error)

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Status changed from Duplicate to New

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Status changed from New to Triaged

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Subject changed from Deferred writes might cause data corruption to Deferred writes might cause "rocksdb: Corruption: Bad table magic number"

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

Has duplicate Bug #54409: OSD fails to start up with "rocksdb: Corruption: Bad table magic number" error added

Actions

Copy link

Updated by Igor Fedotov about 2 years ago

I managed to reproduce (to a major degree) the bug with vstart-ed cluster:
- osd_fast_shutdown = true
- rbd above ec 2+1 pool
- dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd0 count=5 bs=4096
dd if=/dev/random of=/dev/rbd count=5 bs=4096
kill <ceph-osd-pid>

as a result on a subsequent ceph-osd start one can see the following output (with debug-bluestore = 20)
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x42000~3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x42000 length 0x3000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x51000~6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x51000 length 0x6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next 0x5a000~18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 AvlAllocator init_add_free offset 0x5a000 length 0x18fffa6000
2022-03-14T18:45:58.127+0300 7fc1cf4e1e40 10 freelist enumerate_next end

022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 4 0x51000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 5 0x54000~3000
2022-03-14T18:45:58.663+0300 7fc1cf4e1e40 20 bluestore(/home/if/pacific/build/dev/osd0) _deferred_submit_unlock seq 6 0x57000~3000

Please note extents 0x51000~3000 and 0x54000~3000 are marked as free in freelist while still being accessed by deferred write procedure.
If bluefs allocates these extents on startup - original DB corruption would occur. Which is rather a seldom case though...

Actions

Copy link

#10

Updated by Igor Fedotov about 2 years ago

a couple addendums to the previous comment:
- vstart cluster above should use spinning drives or benefit from setting bluestore debug enforce settings = hdd to get deferred writes
- it's pacific release which I've been using. Quincy one generally suffers from the same but NCB stuff might hide the allocation conflicts

Actions

Copy link

#11