Project

General

Profile

Actions

Bug #24901

closed

Client reads fail due to bad CRC under high memory pressure on OSDs

Added by Paul Emmerich almost 6 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've seen problems with read failures due to CRC mismatches on two completely independent clusters with different hardware and software.
The only thing they had in common was that they were running low on memory with Bluestore.
One cluster also triggered the very similar issue http://tracker.ceph.com/issues/22464 before, so this is probably related.

Seen on both 12.2.2 and 12.2.5 with kernel versions 4.9, 4.13 and 4.15, issue appeared on both Debian and CentOS.

This is what it looked like to a kernel cephfs client. A librbd client just hangs when trying to read data from the affected OSD.

2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133
2018-04-06 14:14:58  XXX  kernel  libceph: osd128 172.27.212.112:6864 bad crc/signature
2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133
2018-04-06 14:14:58  XXX  kernel  libceph: osd128 172.27.212.112:6864 bad crc/signature
2018-04-06 14:14:58  XXX  kernel  libceph: read_partial_message ffff885c8d93c900 data crc 456837530 != exp. 1757683133

The "fix" is to restart the affected OSD and reduce the bluestore cache size (or otherwise reduce memory usage).
Neither system had swap configured, so I would have expected an OOM killer crash


Related issues 1 (0 open1 closed)

Related to bluestore - Bug #25006: bad csum during upgrade testCan't reproduce07/19/2018

Actions
Actions #1

Updated by Nathan Cutler almost 6 years ago

  • Related to Bug #25006: bad csum during upgrade test added
Actions #2

Updated by Sage Weil over 5 years ago

  • Status changed from New to Need More Info
Actions #3

Updated by Josh Durgin almost 5 years ago

Has anyone else seen this issue? Is it still occurring in your environment Paul?

Actions #4

Updated by Paul Emmerich almost 5 years ago

The work-around for http://tracker.ceph.com/issues/22464 also fixed this

Actions #5

Updated by Neha Ojha over 4 years ago

  • Status changed from Need More Info to Resolved

Marking this "Resolved" since the workaround fixes this issue.

Actions

Also available in: Atom PDF