Project

General

Profile

Bug #24901

Client reads fail due to bad CRC under high memory pressure on OSDs

Added by Paul Emmerich about 1 year ago. Updated 4 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
07/13/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:

Description

I've seen problems with read failures due to CRC mismatches on two completely independent clusters with different hardware and software.
The only thing they had in common was that they were running low on memory with Bluestore.
One cluster also triggered the very similar issue http://tracker.ceph.com/issues/22464 before, so this is probably related.

Seen on both 12.2.2 and 12.2.5 with kernel versions 4.9, 4.13 and 4.15, issue appeared on both Debian and CentOS.

This is what it looked like to a kernel cephfs client. A librbd client just hangs when trying to read data from the affected OSD.

2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133
2018-04-06 14:14:58  XXX  kernel  libceph: osd128 172.27.212.112:6864 bad crc/signature
2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133
2018-04-06 14:14:58  XXX  kernel  libceph: osd128 172.27.212.112:6864 bad crc/signature
2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133

The "fix" is to restart the affected OSD and reduce the bluestore cache size (or otherwise reduce memory usage).
Neither system had swap configured, so I would have expected an OOM killer crash


Related issues

Related to bluestore - Bug #25006: bad csum during upgrade test Can't reproduce 07/19/2018

History

#1 Updated by Nathan Cutler about 1 year ago

  • Related to Bug #25006: bad csum during upgrade test added

#2 Updated by Sage Weil 9 months ago

  • Status changed from New to Need More Info

#3 Updated by Josh Durgin 4 months ago

Has anyone else seen this issue? Is it still occurring in your environment Paul?

#4 Updated by Paul Emmerich 4 months ago

The work-around for http://tracker.ceph.com/issues/22464 also fixed this

Also available in: Atom PDF