Project

General

Profile

Actions

Bug #20863

closed

CRC error does not mark PG as inconsistent or queue for repair

Added by Dmitry Glushenok over 6 years ago. Updated over 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
Administration/Usability
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
5 - suggestion
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While testing bitrot detection it was found that even when OSD process has detected CRC mismatch and returned an error to client, the cluster state remains HEALTH_OK.

Steps to reproduce:

  1. ceph -v
    ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) #
  2. cat testobject
    somedata #
  3. rados --cluster mn --pool mn_test01 put testobject ./testobject #
  4. ceph osd map mn_test01 testobject
    osdmap e22984 pool 'mn_test01' (16) object 'testobject' -> pg 16.98824931 (16.31) -> up ([20,44], p20) acting ([20,44], p20) #
  5. systemctl stop ceph-osd@20 #
  6. echo CORRUPTED > /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 #
  7. getfattr -d -e hex var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10
  8. file: var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10
    user.ceph._=0x0f08ef00000004032b000000000000000a000000746573746f626a656374feffffffffffffff314982980000000000100000000000000006031c0000001000000000000000ffffffff0000000000000000ffffffffffffffff000000000100000000000000c859000000000000000000000000000002021500000008663a0600000000000100000000000000000000000900000000000000e4437f59c293ca04020215000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000034000000e4437f591b89e204e08fe5f3ffffffff
    user.ceph.snapset=0x02021900000000000000000000000100000000000000000000000000000000
    user.cephos.spill_out=0x3000 #
  9. systemctl start ceph-osd@20 #
  10. rados --cluster mn --pool mn_test01 get testobject testobject
    error getting mn_test01/testobject: (5) Input/output error #
  11. grep ERR /var/log/ceph/mn-osd.20.log
    2017-07-31 18:23:48.437679 7f496d418700 -1 log_channel(cluster) log [ERR] : 16.31 full-object read crc 0x2259dfb0 != expected 0xf3e58fe0 on 16:8c924119:::testobject:head #
  12. ceph -s
    ...
    health HEALTH_OK
    ... #

Running then deep-scrub on PG triggers HEALTH_ERR. PG repair successfully repairs damaged file.

Other strange issues:
  • removing xattrs on /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 (by changing it using vi) returns to client "error getting mn_test01/testobject: (2) No such file or directory" without any errors in OSD log files
  • adding garbage to the end of /var/lib/ceph/osd/mn-20/current/16.31_head/testobject__head_98824931__10 does not triggers CRC checksum error (xattr has object size which is not checked against real file size?)

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Feature #19657: An EIO from a single device should not be a client-visible failure.ResolvedDavid Zafman04/18/2017

Actions
Actions #1

Updated by Greg Farnum over 6 years ago

  • Project changed from Ceph to RADOS
  • Subject changed from CRC error while reading an object does not mark PG as inconsistent to CRC error does not mark PG as inconsistent or queue for repair
  • Category changed from OSD to Administration/Usability
  • Component(RADOS) OSD added
Actions #2

Updated by David Zafman over 6 years ago

  • Is duplicate of Feature #19657: An EIO from a single device should not be a client-visible failure. added
Actions #3

Updated by David Zafman over 6 years ago

This will be available in Luminous, see http://tracker.ceph.com/issues/19657

Actions #4

Updated by Greg Farnum over 6 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF