Project

General

Profile

Actions

Bug #45381

open

unfound objects in erasure-coded CephFS

Added by Paul Emmerich about 4 years ago. Updated almost 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Encountered something weird with cephfs today that shouldn't happen

Setup:

  • Ceph 14.2.8
  • 8 OSD servers, 8 SSDs in each
  • Running CephFS with an erasure-coded data pool (default data pool on replicated), erasure coding is 5+3

The cluster encountered some serious network issues: random dead links and corrupted data sent to user space due to a bad driver/firmware/NIC (unclear)

Anyways, corrupted network data passed to userspace shouldn't be a problem for anything important as everything in Ceph is checksummed separately.
There's one small issue: the initial packets aren't checksummed while the connection is established. Long story short there's an unrelated bug that causes a Ceph client to crash when the authorizer_size field in the connection_reply packet is corrupted; there's an assert on it being < 4096 bytes.

So something bad happened on the NIC, userspace got junk data, some OSDs crashed. Sometimes a server went offline because it lost both links.

So far a good stress test for Ceph's self-healing capabilities and everything worked despite crashing OSDs and servers randomly going offline.
This is great so far, but then something happened that shouldn't:

rados reported 5 unfound objects in one PG. Checking details on the missing objects showed:

  • affected PG was in the erasure coded cephfs pool
  • all objects still had 2 shards (out of 8 shards) available
  • the 2 OSDs that still had shards didn't crash (but others from the PG did)

so I've checked which files were affected, turns out all of them had been deleted so no data was lost.

Looks like RADOS somehow got confused about the deletion state of these objects.

It kind of reminds me of https://tracker.ceph.com/issues/44286 but the pre-conditions and environment tirggering this seem very different

Actions

Also available in: Atom PDF