Primary OSD crash caused corrupted object and further crashes during backfill after successful write
We ran into an issue on our EC, object storage, Nautilus cluster just before Christmas, that although we 'resolved', I thought the behaviour was odd enough to warrant a bug report.
A filestore OSD (14.2.9) crashed with a very sick disk (thousands of pending sectors). One of the last things it was doing was a write for a RGW multipart upload. The RGW reported the upload as successful to the client. I've attached the log of the OSD (ceph-osd.1466.log.1.gz). The log has references to the specific object write action and I/O errors just before the crash, but the client did receive a success for this part of the multipart upload.
For reference, the PG was 11.214 (EC 8+3 pool, with a mixture of Filestore and Bluestore OSDs)
The object was 962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201/in5/exp_4-05-737/processed/spe/sqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22
When the down out interval was hit for the old primary, the backfill started onto the new primary. When the backfill reached the object in question, the acting primary crashed. This continued until systemd stopped restarting the OSD, and then the next acting primary restarted the backfill, and then crashed at the same point, eventually bringing the PG down. The crash looked something like this:
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true, error:false) obc refcount=4 state=IDLE waiting_on_pushes= extent_requested=0,0)
get_hash_info: Getting attr on 11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
get_hash_info: not in cache 11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head
get_hash_info: found on disk, size 524288
get_hash_info: Mismatch of total_chunk_size 0
2020-12-10 15:14:16.136 7fc0a1575700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fc0a1575700 time 2020-12-10 15:14:16.132060
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc: 585: FAILED ceph_assert(op.hinfo)
(I have the backfill crash with debug_osd set to 20 if it could be used)
Setting nobackfill and restarting the crashed OSDs allowed the PG to remain up. Trying to read the object with a rados get caused the following message to appear in the logs of all OSDs with data shards in the set.
2020-12-10 23:22:31.840 7fc7161e3700 -1 log_channel(cluster) log [ERR] : Corruption
is missing hash_info
Which indicated to me that the object was corrupt across all OSDs in the PG.
After trying a scrub and a PG repair (which did not run, presumably because the PG was degraded), I deleted the object using a rados rm, which allowed the PG to continue backfilling past the point where it was crashing, and eventually came back to active+clean.
Thankfully, the user still had the file elsewhere, so we didn't lose data, but given the fact they had received no indication that the write had failed, it could have been much worse!
I reported the issue on the mailing list, which has some more info:
- the original crashing primary with an unhealthy disk managed to corrupt the object across all OSD in the set
- the corrupted object then caused otherwise healthy OSDs to crash while trying to backfill
- the S3 client was informed the write was successful, despite the object being corrupted at write.
I realise the issue is kind of convoluted, I don't have a way of reproducing it, and since I deleted the corrupt object to get the PG back into a safe state, there isn't much to poke at now, but I wanted to at least report it! I'm happy to try and provide more info if I can, just ask.
#1 Updated by Tom Byrne almost 3 years ago
We just ran into this again and had to remove the object to allow the PG to finish backfilling. The similarities between the reported issue were:
- the primary OSD was filestore
- the primary OSD had an unhealthy disk, and crashed and restarted when the corruption occurred
- the PG was in the EC RGW data pool (but a different PG)
Given this, we have started to pull out all Filestore OSDs with any signs of unhealthy disks, hopefully avoiding us running into this in the future.
#4 Updated by Mykola Golub over 2 years ago
We have a customer who experienced the same issue. In our case the hash info was corrupted only on two shards. I have no explanation about how the inconsistency was introduced but I pushed a PR  for review that allows the osd to handle it more gracefully during backfill, i.e. try to recover automatically from other shards or mark the object missing instead of crashing.
#5 Updated by Tom Byrne over 2 years ago
That's awesome, thanks. The behaviour you suggest sounds sensible to me.
Since it's been a while, I should probably update this issue - we haven't had a reoccurrence of the corruption and associated crashes during backfill since we started removing filestore OSDs as soon as the disks display any SMART errors. Given that, I'm hopeful that the root cause of the corruption for us was something to do with the 'crashing primary filestore OSD on a sad disk'. We're reasonably close to retiring the last generation of storage nodes with filestore OSDs here, so we don't have too much longer to worry about it.
#6 Updated by Mykola Golub over 2 years ago
Just for the record. In our customer case it was a mix of bluestore and filestore osds. The primary osd was the failstore, which started to fail. After it was replaced and backfilling started, the corruption of an object hinfo ("get_hash_info: Mismatch of total_chunk_size 0") was detected on two osds from the set -- one filestore and another bluestore. All other (filestore) osds had the valid hash info.
#7 Updated by Mykola Golub over 2 years ago
I think I found an issue in `ECBackend::get_hash_info` that might be responsible for introducing the inconsistency in our case. If `store->stat` returns with an error, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later. I was able to reproduce the issue by emulating the error.
I added a commit to my PR  that fixes this.