Bug #24901
closedClient reads fail due to bad CRC under high memory pressure on OSDs
0%
Description
I've seen problems with read failures due to CRC mismatches on two completely independent clusters with different hardware and software.
The only thing they had in common was that they were running low on memory with Bluestore.
One cluster also triggered the very similar issue http://tracker.ceph.com/issues/22464 before, so this is probably related.
Seen on both 12.2.2 and 12.2.5 with kernel versions 4.9, 4.13 and 4.15, issue appeared on both Debian and CentOS.
This is what it looked like to a kernel cephfs client. A librbd client just hangs when trying to read data from the affected OSD.
2018-04-06 14:14:58 XXX kernel libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133 2018-04-06 14:14:58 XXX kernel libceph: osd128 172.27.212.112:6864 bad crc/signature 2018-04-06 14:14:58 XXX kernel libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133 2018-04-06 14:14:58 XXX kernel libceph: osd128 172.27.212.112:6864 bad crc/signature 2018-04-06 14:14:58 XXX kernel libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133
The "fix" is to restart the affected OSD and reduce the bluestore cache size (or otherwise reduce memory usage).
Neither system had swap configured, so I would have expected an OOM killer crash