Bug #17545: Data corruption using RBD with caching enabled - Ceph - Ceph

Actions

Copy link

Bug #17545

closed

Data corruption using RBD with caching enabled

Added by Wido den Hollander over 7 years ago. Updated over 7 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

librbd

Target version:

% Done:

Source:

other

Tags:

rbd,corruption,windows,sqlserver,caching,writeback

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

This was reported on launchpad, but I think it's better suited to be reported here: https://bugs.launchpad.net/mos/+bug/1627775

The situation is that when using Windows on top of RBD with caching enabled Windows 2012R2 complains about page corruptions.

Tested with both Firefly and Hammer it only happens on RBD backed volumes with caching enabled. When the writeback cache is disabled the problem does NOT occur.

The issue is not reproducible on LVM/file based storage.

Steps to reproduce: run SQL Server running on Windows 2012R2 or SQLioSim (stress test utility emulating SQL server)

Expected results: no errors

Actual result:
xpected FileId: 0x0
Received FileId: 0x0
Expected PageId: 0xCB19C
Received PageId: 0xCB19A (does not match expected)
Received CheckSum: 0x9F444071
Calculated CheckSum: 0x89603EC9 (does not match expected)
Received Buffer Length: 0x2000

Reproducibility: steadily reproducable with SQLioSim

Like mentioned, the workaround is currently to disable RBD caching, but that kills the performance of the system completely.

The issue has been reproduced using OpenStack on Ubuntu 12.04 and 14.04, but also on Proxmox. This hints towards a RBD issue and not so much a Qemu issue.

We still have to test this with the Jewel client (librbd) on the systems, but so far Firefly and Hammer have the same result.

Related issues 1 (0 open — 1 closed)