Bug #63876: rados/cephfs: apparent file corruption on cephfs on an EC pool - Ceph - Ceph

Actions

Copy link

Bug #63876

open

rados/cephfs: apparent file corruption on cephfs on an EC pool

Added by Samuel Just 5 months ago. Updated 4 months ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Adapted from https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NZE7HNDXSHAWISDHHZQXHKH2ZG4LQ4X7/

I'm seeing different results from reading files, depending on which OSDs
are running, including some incorrect reads with all OSDs running, in
CephFS from a pool with erasure coding. I'm running Ceph 17.2.6.

More detail

In particular, I have a relatively large backup of some files, combined
with SHA-256 hashes of the files (which were verified when the backup
was created, approximately 7 months ago). Verifying these hashes
currently gives several errors, both in large and small files, but
somewhat tilted towards larger files.

Investigating which PGs stored the relevant files (using
cephfs-data-scan pg_files) didn't show the problem to be isolated to one
PG, but did show several PGs that contained OSD 15 as an active member.

Taking OSD 15 offline leads to better reads (more files with correct
SHA-256 hashes), but not completely correct reads. Further investigation
implicated OSD 34 as another potential issue, but taking it offline also
results in more correct files but not completely.

Bringing the stopped OSDs (15 and 34) back online results in the earlier
(incorrect) hashes when reading files, as might be expected, but this
seems to demonstrate that the correct information (or at least more
correct information) is still on the drives.

The hashes I receive for a given corrupted file are consistent from read
to read (including on different hosts, to avoid caching as an issue),
but obviously sometimes change if I take an affected OSD offline.

Recent history

I have Ceph configured with a deep scrub interval of approximately 30
days, and they have completed regularly with no issues identified.
However, within the past two weeks I added two additional drives to the
cluster, and rebalancing took about two weeks to complete: the placement
groups I took notice of having issues were not deep scrubbed since the
replacement completed, so it is possible something got corrupted during
the rebalance.

Neither OSD 15 nor 34 is a new drive, and as far as I have experienced
(and Ceph's health indications have shown), all of the existing OSDs
have behaved correctly up to this point.

Configuration

I created an erasure coding profile for the pool in question using the
following command:

ceph osd erasure-code-profile set erasure_k4_m2 \
       plugin=jerasure \
       k=4 m=2 \
       technique=blaum_roth \
       crush-device-class=hdd

And the following CRUSH rule is used for the pool:

rule erasure_k4_m2_hdd_rule {
         id 3
         type erasure
         min_size 4
         max_size 6
         step take default class hdd
         step choose indep 3 type host
         step chooseleaf indep 2 type osd
         step emit
       }

Questions

1. Does this behavior ring a bell to anyone? Is there something obvious
I'm missing or should do?

2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully
not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34,
and will likely follow up with the rest of the pool.)

3. Is there a way to force "full reads" or otherwise to use all of the
EC chunks (potentially in tandem with on-disk checksums) to identify the
correct data, rather than a combination of the data from the primary OSDs?

Files

Download all files

bad_read_logs.zip (14.8 KB) bad_read_logs.zip	Logs from an attempted read of object 1000268bd02.00000000	- aschmitz, 01/03/2024 04:41 AM
notes.txt (2.99 KB) notes.txt	Annotated shell commands run to gather data	- aschmitz, 01/03/2024 04:44 AM
1000266ce21.00000000.osd15.gz (392 KB) 1000266ce21.00000000.osd15.gz		Laura Flores, 01/04/2024 07:54 PM
1000266ce21.00000000.osd9.gz (336 KB) 1000266ce21.00000000.osd9.gz		Laura Flores, 01/04/2024 07:54 PM
1000266ce21.00000000.osd16.gz (373 KB) 1000266ce21.00000000.osd16.gz		Laura Flores, 01/04/2024 07:54 PM
1000266ce21.00000000.osd22.gz (392 KB) 1000266ce21.00000000.osd22.gz		Laura Flores, 01/04/2024 07:54 PM
1000266ce21.00000000.osd24.gz (336 KB) 1000266ce21.00000000.osd24.gz		Laura Flores, 01/04/2024 07:54 PM
1000266ce21.00000000.osd34.gz (392 KB) 1000266ce21.00000000.osd34.gz		Laura Flores, 01/04/2024 07:54 PM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Samuel Just 5 months ago

Description updated (diff)

Actions

Copy link

Updated by Samuel Just 5 months ago

(Copied from conversation in slack ceph-devel)

I'm following up on some apparent data corruption in an EC pool (I'll thread a link to my ceph-users message with more background), which appears to be presenting in an unexpected way: from what I can tell, individual shards were corrupted before being saved, so each per-OSD checksum matches the expected value, and deep scrubbing doesn't identify an issue, but requests for the objects from the OSD trigger a checksum mismatch warning. Manual testing by disabling a targeted OSD makes some objects readable, so I suspect enough of the data is stored to recover, but in this case the automated checks aren't catching (or fixing) it.
I'd like to pull the individual shards of a relevant object from all of the acting OSDs directly (rather than just k of them), so I can pass them to ceph-erasure-code-tool and hopefully recover the data. This is a bit tedious, and I'll eventually want to write more tooling around it and dig into the root cause of the issue, but for now I'm just trying to get the data out of the pool so I feel more comfortable I'm not going to lose it. (Of course, other ways of recovering the data are probably also welcome, but this seems the most expedient off the top of my head.)
All of that is to ask: what's a relatively quick way of retrieving the EC shards for an object from the OSDs? It doesn't seem there's a way to get rados (the CLI tool), or even librados to do it. I'm about to start trying to modify Objecter to change an Op for an OSD read into a shard read (ECSubRead? I haven't gotten that far.), but I figured I'd ask here and see if anyone else has thoughts. (Or can point me towards another place to ask.) I'd sort of rather avoid modifying the osd daemon itself to do this since the pool is in use, but if there's a better path that way - or by other direct access to the backing devices - that would be good to know too. (edited)

Actions

Copy link

Updated by Samuel Just 5 months ago

First, identify the name of the object, the pg, and the acting set osds. Stop all of the acting set OSDs. ceph-objectstore-tool can then be used to extract the shard for the object from each osd store. If you can attach those pieces to a bug, it would be helpful in identifying what's going on.

Actions

Copy link

Updated by Samuel Just 5 months ago

Also, can you expand on what you mean by "Oddly, the CephFS kernel module (at least in Ubuntu 22.04 appears to ignore the OSD error and return incorrect data for the parts of the file composed of affected objects, while both the rados CLI tool and the cephfs FUSE helper treat it as an I/O error."? Normal reads in cephfs simply return the invalid data but the rados cli tool actually returns an error? What error, is there also an error log? Can you enable debug_osd=20 debug_bluestore=20 debug_ms=1 on the OSDs in the acting set, reproduce that read error, and attach the logs from that (hopefully short) period of time?

Actions

Copy link Download all files

Updated by - aschmitz 4 months ago

File bad_read_logs.zip bad_read_logs.zip added
File notes.txt notes.txt added

Okay, I've done as directed: I identified an object that is part of a CephFS file which was corrupted (1000268bd02.00000000), and cranked up the debug logging while attempting to read it with

rados get

. The acting set was [15,25,24,7,36,32], with osd.15 being primary.

The cluster was not quite silent during this time, so there may be a small number of additional entries in the log, but I've isolated the tenth of a second during which the read occurred, so it should be pretty concise. (I had also performed this read shortly before taking the logs, which I suspect led to some of the cache log entries. That doesn't seem to have changed the results, but apologies for any extra noise.)

The

rados get -p cephfs_data_bulk 1000268bd02.00000000 -

command, with all of the OSDs online, returns "error getting cephfs_data_bulk/1000268bd02.00000000: (5) Input/output error". The primary OSD logs "log_channel(cluster) log [ERR] : full-object read crc 0x98dc9095 != expected 0x487f5378 on 13:673a48ee:::1000268bd02.00000000:head".

Those logs, along with the commands that were run and their results, are attached. I also used a mount from the CephFS kernel module to copy the file out (with corrupted data as a result, but no errors on the client side) for examination. Additionally, as directed I took the individual OSDs offline and retrieved the shards of data from each. Finally, I noticed that with osd.15 offline I was able to read the object successfully using the same

rados get

command, so I saved that off as well.

Everything except the logs are too big to attach here (some are just over the 1000 KB limit, others are 4-14 MB), so I've also uploaded everything to https://github.com/aschmitz/ceph-63876.

Perhaps an academic issue, but I was unable to use

ceph-erasure-code-tool

to reassemble the EC shards. If I'm missing something obvious there, any pointers would be appreciated.

Actions

Copy link

Updated by - aschmitz 4 months ago

By using effectively the same "take one OSD down and read again" mechanism, I identified that all of the obvious corruption in the set of data I was examining appears to have taken place on three OSDs: roughly 99.9% split about evenly between two OSDs on one server, and a tiny portion on an OSD on a different server. I'm not sure if that means anything to anyone, but I'll mention it in case it does.

Additionally, the vast majority of data stored was fine, the affected objects were maybe 5.5 GB out of ~38 TB for this project, and there were plenty of working objects on each PG: it's not as if a whole PG or OSD was affected. (The pool has 256 PGs.)

I was also able to recover all of the data I was examining via taking those individual drives offline and reading the "bad" objects, then reassembling the affected files, so I don't think there has been any data loss in the end. (However, this pool is used for other things, and I have yet to examine them: I would not be surprised to find further corruption with other objects.)

For timeline matters, a few notes:

I copied in all of the files and then a few days later generated SHA-256 hashes of all of them, from the copies on this pool. At that time, the hashes matched the data I had copied. I am very confident on this.
Some time later - approximately a month or two later - I shared copies of the files with another person. Their copies appear to be corrupt in the same exact way my current reads are when reading via the CephFS kernel module, from comparing hashes of several files. (Their copies were served from a mount using the kernel module, so this is not a surprise, but it appears the corruption has not increased since it happened in the first place.) I am moderately confident on this.
The cluster has had drives added at least once, and possibly twice since the corrupt copies above were taken. I am very confident on this, and it would have caused data to move between drives. I also suspect I upgraded at least from Pacific to Quincy during that time, but I am not confident on when that occurred. (I know I am currently on Quincy: 17.2.6.)
The cluster is set to perform deep scrubs approximately every two weeks, and has not reported any health issues relating to them. I am confident on this.

I'm not actually sure how the blaum_roth EC mechanism works, but I suppose it is possible that data got written incorrectly in the first place, given the successful deep scrubs (assuming they actually check the hash of the shards on disk, as they appear to not reassemble objects themselves). I'm also not sure how the data got corrupted in the first place: RAM or CPU malfunctions are a possible culprit, perhaps. It is also possible that adding drives may have moved some PGs between physical servers, so it is possible that only one server corrupted data.

I have a couple overlapping questions:

How can I detect corruption like this? It appears I could loop over every object on the pool and attempt to read it, which will tell me if the first four active shards are accurate at least (or I've run into a CRC collision), but it's not clear to me that there's a way to force reading or verification of the parity shards.
How can I force repair of this kind of corruption? I can manually reassemble files and write them back out (which is what I've done so far, though I haven't yet removed the originals so they can be debugged), but is there a way to trigger a repair in Ceph? It appears even a forced deep scrub of a PG doesn't detect the issue.

Actions

Copy link Download all files