Bug #63876: rados/cephfs: apparent file corruption on cephfs on an EC pool - Ceph - Ceph

Bug #63876

Adapted from https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NZE7HNDXSHAWISDHHZQXHKH2ZG4LQ4X7/ 

 I'm seeing different results from reading files, depending on which OSDs 
 are running, including some incorrect reads with all OSDs running, in 
 CephFS from a pool with erasure coding. I'm running Ceph 17.2.6. 

 # More detail 

 In particular, I have a relatively large backup of some files, combined 
 with SHA-256 hashes of the files (which were verified when the backup 
 was created, approximately 7 months ago). Verifying these hashes 
 currently gives several errors, both in large and small files, but 
 somewhat tilted towards larger files. 

 Investigating which PGs stored the relevant files (using 
 cephfs-data-scan pg_files) didn't show the problem to be isolated to one 
 PG, but did show several PGs that contained OSD 15 as an active member. 

 Taking OSD 15 offline leads to *better* reads (more files with correct 
 SHA-256 hashes), but not completely correct reads. Further investigation 
 implicated OSD 34 as another potential issue, but taking it offline also 
 results in more correct files but not completely. 

 Bringing the stopped OSDs (15 and 34) back online results in the earlier 
 (incorrect) hashes when reading files, as might be expected, but this 
 seems to demonstrate that the correct information (or at least more 
 correct information) is still on the drives. 

 The hashes I receive for a given corrupted file are consistent from read 
 to read (including on different hosts, to avoid caching as an issue), 
 but obviously sometimes change if I take an affected OSD offline. 

 # Recent history 

 I have Ceph configured with a deep scrub interval of approximately 30 
 days, and they have completed regularly with no issues identified. 
 However, within the past two weeks I added two additional drives to the 
 cluster, and rebalancing took about two weeks to complete: the placement 
 groups I took notice of having issues were not deep scrubbed since the 
 replacement completed, so it is possible something got corrupted during 
 the rebalance. 

 Neither OSD 15 nor 34 is a new drive, and as far as I have experienced 
 (and Ceph's health indications have shown), all of the existing OSDs 
 have behaved correctly up to this point. 

 # Configuration 

 I created an erasure coding profile for the pool in question using the 
 following command: 

      ceph osd erasure-code-profile set erasure_k4_m2 \ 
        plugin=jerasure \ 
        k=4 m=2 \ 
        technique=blaum_roth \ 
        crush-device-class=hdd 

 And the following CRUSH rule is used for the pool: 

        rule erasure_k4_m2_hdd_rule { 
          id 3 
          type erasure 
          min_size 4 
          max_size 6 
          step take default class hdd 
          step choose indep 3 type host 
          step chooseleaf indep 2 type osd 
          step emit 
        } 

 # Questions 

 1. Does this behavior ring a bell to anyone? Is there something obvious 
 I'm missing or should do? 

 2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully 
 not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34, 
 and will likely follow up with the rest of the pool.) 

 3. Is there a way to force "full reads" or otherwise to use all of the 
 EC chunks (potentially in tandem with on-disk checksums) to identify the 
 correct data, rather than a combination of the data from the primary OSDs?

Back

Project

General

Profile

Ceph

Bug #63876