Bug #20863: CRC error does not mark PG as inconsistent or queue for repair - RADOS - Ceph

Actions

Copy link

Bug #20863

closed

CRC error does not mark PG as inconsistent or queue for repair

Added by Dmitry Glushenok over 6 years ago. Updated over 6 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Administration/Usability

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

5 - suggestion

Reviewed:

Affected Versions:

Ceph - v10.2.7

ceph-qa-suite:

Component(RADOS):

OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While testing bitrot detection it was found that even when OSD process has detected CRC mismatch and returned an error to client, the cluster state remains HEALTH_OK.

Steps to reproduce:

ceph -v
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) #
cat testobject
somedata #
rados --cluster mn --pool mn_test01 put testobject ./testobject #
ceph osd map mn_test01 testobject
osdmap e22984 pool 'mn_test01' (16) object 'testobject' -> pg 16.98824931 (16.31) -> up ([20,44], p20) acting ([20,44], p20) #
systemctl stop ceph-osd@20 #
echo CORRUPTED > /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 #
getfattr -d -e hex var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10
file: var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10
user.ceph._=0x0f08ef00000004032b000000000000000a000000746573746f626a656374feffffffffffffff314982980000000000100000000000000006031c0000001000000000000000ffffffff0000000000000000ffffffffffffffff000000000100000000000000c859000000000000000000000000000002021500000008663a0600000000000100000000000000000000000900000000000000e4437f59c293ca04020215000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000034000000e4437f591b89e204e08fe5f3ffffffff
user.ceph.snapset=0x02021900000000000000000000000100000000000000000000000000000000
user.cephos.spill_out=0x3000 #
systemctl start ceph-osd@20 #
rados --cluster mn --pool mn_test01 get testobject testobject
error getting mn_test01/testobject: (5) Input/output error #
grep ERR /var/log/ceph/mn-osd.20.log
2017-07-31 18:23:48.437679 7f496d418700 -1 log_channel(cluster) log [ERR] : 16.31 full-object read crc 0x2259dfb0 != expected 0xf3e58fe0 on 16:8c924119:::testobject:head #
ceph -s
...
health HEALTH_OK
... #

Running then deep-scrub on PG triggers HEALTH_ERR. PG repair successfully repairs damaged file.

Other strange issues:

removing xattrs on /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 (by changing it using vi) returns to client "error getting mn_test01/testobject: (2) No such file or directory" without any errors in OSD log files
adding garbage to the end of /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 does not triggers CRC checksum error (xattr has object size which is not checked against real file size?)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum over 6 years ago

Project changed from Ceph to RADOS
Subject changed from CRC error while reading an object does not mark PG as inconsistent to CRC error does not mark PG as inconsistent or queue for repair
Category changed from OSD to Administration/Usability
Component(RADOS) OSD added