Bug #48959: Primary OSD crash caused corrupted object and further crashes during backfill after successful write - RADOS - Ceph

Actions

Copy link

Bug #48959

closed

Primary OSD crash caused corrupted object and further crashes during backfill after successful write

Added by Tom Byrne over 3 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Mykola Golub

Category:

Target version:

% Done:

100%

Source:

Tags:

backport_processed

Backport:

pacific,octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

43239

Crash signature (v1):

Crash signature (v2):

Description

Hi,

We ran into an issue on our EC, object storage, Nautilus cluster just before Christmas, that although we 'resolved', I thought the behaviour was odd enough to warrant a bug report.

A filestore OSD (14.2.9) crashed with a very sick disk (thousands of pending sectors). One of the last things it was doing was a write for a RGW multipart upload. The RGW reported the upload as successful to the client. I've attached the log of the OSD (ceph-osd.1466.log.1.gz). The log has references to the specific object write action and I/O errors just before the crash, but the client did receive a success for this part of the multipart upload.

For reference, the PG was 11.214 (EC 8+3 pool, with a mixture of Filestore and Bluestore OSDs)
The object was 962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201/in5/exp_4-05-737/processed/spe/sqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22

When the down out interval was hit for the old primary, the backfill started onto the new primary. When the backfill reached the object in question, the acting primary crashed. This continued until systemd stopped restarting the OSD, and then the next acting primary restarted the backfill, and then crashed at the same point, eventually bringing the PG down. The crash looked something like this:

recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true, error:false) obc refcount=4 state=IDLE waiting_on_pushes= extent_requested=0,0) get_hash_info: Getting attr on 11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head get_hash_info: not in cache 11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head get_hash_info: found on disk, size 524288 get_hash_info: Mismatch of total_chunk_size 0 2020-12-10 15:14:16.136 7fc0a1575700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fc0a1575700 time 2020-12-10 15:14:16.132060 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/osd/ECBackend.cc: 585: FAILED ceph_assert(op.hinfo)

(I have the backfill crash with debug_osd set to 20 if it could be used)

Setting nobackfill and restarting the crashed OSDs allowed the PG to remain up. Trying to read the object with a rados get caused the following message to appear in the logs of all OSDs with data shards in the set.

2020-12-10 23:22:31.840 7fc7161e3700 -1 log_channel(cluster) log [ERR] : Corruption detected: object 11:28447b4a:::962de230-ed6c-44f2-ab02-788c52ea6a82.3210530112.122__multipart_201%2fin5%2fexp_4-05-737%2fprocessed%2fspe%2fsqw_187570.nxspe.2~bgZPo_rC64ZXJWKyTfdn4dIApqLNDPp.22:head is missing hash_info

Which indicated to me that the object was corrupt across all OSDs in the PG.

After trying a scrub and a PG repair (which did not run, presumably because the PG was degraded), I deleted the object using a rados rm, which allowed the PG to continue backfilling past the point where it was crashing, and eventually came back to active+clean.

Thankfully, the user still had the file elsewhere, so we didn't lose data, but given the fact they had received no indication that the write had failed, it could have been much worse!

I reported the issue on the mailing list, which has some more info:

https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HIU6F5DVANYSZSTMS67UCL3STL5HWJ6Q/

My concerns with the sequence of events are that:

the original crashing primary with an unhealthy disk managed to corrupt the object across all OSD in the set
the corrupted object then caused otherwise healthy OSDs to crash while trying to backfill
the S3 client was informed the write was successful, despite the object being corrupted at write.

I realise the issue is kind of convoluted, I don't have a way of reproducing it, and since I deleted the corrupt object to get the PG back into a safe state, there isn't much to poke at now, but I wanted to at least report it! I'm happy to try and provide more info if I can, just ask.

Files

ceph-osd.1466.log.1.gz (221 KB) ceph-osd.1466.log.1.gz

Tom Byrne, 01/21/2021 06:56 PM

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Tom Byrne about 3 years ago

We just ran into this again and had to remove the object to allow the PG to finish backfilling. The similarities between the reported issue were:

the primary OSD was filestore
the primary OSD had an unhealthy disk, and crashed and restarted when the corruption occurred
the PG was in the EC RGW data pool (but a different PG)

Given this, we have started to pull out all Filestore OSDs with any signs of unhealthy disks, hopefully avoiding us running into this in the future.

Actions

Copy link

Updated by Sage Weil about 3 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Status changed from New to Fix Under Review
Assignee set to Mykola Golub
Pull request ID set to 43239

Actions

Copy link

Updated by Mykola Golub over 2 years ago

We have a customer who experienced the same issue. In our case the hash info was corrupted only on two shards. I have no explanation about how the inconsistency was introduced but I pushed a PR [1] for review that allows the osd to handle it more gracefully during backfill, i.e. try to recover automatically from other shards or mark the object missing instead of crashing.

[1] https://github.com/ceph/ceph/pull/43239

Actions

Copy link

Updated by Tom Byrne over 2 years ago

That's awesome, thanks. The behaviour you suggest sounds sensible to me.

Since it's been a while, I should probably update this issue - we haven't had a reoccurrence of the corruption and associated crashes during backfill since we started removing filestore OSDs as soon as the disks display any SMART errors. Given that, I'm hopeful that the root cause of the corruption for us was something to do with the 'crashing primary filestore OSD on a sad disk'. We're reasonably close to retiring the last generation of storage nodes with filestore OSDs here, so we don't have too much longer to worry about it.

Actions

Copy link

Updated by Mykola Golub over 2 years ago

Just for the record. In our customer case it was a mix of bluestore and filestore osds. The primary osd was the failstore, which started to fail. After it was replaced and backfilling started, the corruption of an object hinfo ("get_hash_info: Mismatch of total_chunk_size 0") was detected on two osds from the set -- one filestore and another bluestore. All other (filestore) osds had the valid hash info.

Actions

Copy link

Updated by Mykola Golub over 2 years ago

I think I found an issue in `ECBackend::get_hash_info` that might be responsible for introducing the inconsistency in our case. If `store->stat` returns with an error, we create an "empty" hash info, which may propagate from the primary osd during backfill to the backfilling non-primary osd, introducing the inconsistency which may cause issues later. I was able to reproduce the issue by emulating the error.

I added a commit to my PR [2] that fixes this.

[1] https://github.com/ceph/ceph/blob/be23d3cfbfe245af9e2f32a9ce1a956eee8b8d1c/src/osd/ECBackend.cc#L1828
[2] https://github.com/ceph/ceph/pull/43239

Actions

Copy link