Bug #17545
closedData corruption using RBD with caching enabled
0%
Description
This was reported on launchpad, but I think it's better suited to be reported here: https://bugs.launchpad.net/mos/+bug/1627775
The situation is that when using Windows on top of RBD with caching enabled Windows 2012R2 complains about page corruptions.
Tested with both Firefly and Hammer it only happens on RBD backed volumes with caching enabled. When the writeback cache is disabled the problem does NOT occur.
The issue is not reproducible on LVM/file based storage.
Steps to reproduce: run SQL Server running on Windows 2012R2 or SQLioSim (stress test utility emulating SQL server)
Expected results: no errors
Actual result:
xpected FileId: 0x0
Received FileId: 0x0
Expected PageId: 0xCB19C
Received PageId: 0xCB19A (does not match expected)
Received CheckSum: 0x9F444071
Calculated CheckSum: 0x89603EC9 (does not match expected)
Received Buffer Length: 0x2000
Reproducibility: steadily reproducable with SQLioSim
Like mentioned, the workaround is currently to disable RBD caching, but that kills the performance of the system completely.
The issue has been reproduced using OpenStack on Ubuntu 12.04 and 14.04, but also on Proxmox. This hints towards a RBD issue and not so much a Qemu issue.
We still have to test this with the Jewel client (librbd) on the systems, but so far Firefly and Hammer have the same result.
Updated by Wido den Hollander over 7 years ago
- Release set to firefly
- Release set to hammer
Updated by Wido den Hollander over 7 years ago
Seems like it has been fixed by #16002
Tests have been running with that fix applied on a Hammer client and after 24 hours we haven't seen the issue come back.
Updated by Greg Farnum over 7 years ago
- Is duplicate of Backport #16546: hammer: ObjectCacher doesn't correctly handle read replies on split BufferHeads added