Project

General

Profile

Actions

Bug #45381

open

unfound objects in erasure-coded CephFS

Added by Paul Emmerich almost 4 years ago. Updated almost 4 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Encountered something weird with cephfs today that shouldn't happen

Setup:

  • Ceph 14.2.8
  • 8 OSD servers, 8 SSDs in each
  • Running CephFS with an erasure-coded data pool (default data pool on replicated), erasure coding is 5+3

The cluster encountered some serious network issues: random dead links and corrupted data sent to user space due to a bad driver/firmware/NIC (unclear)

Anyways, corrupted network data passed to userspace shouldn't be a problem for anything important as everything in Ceph is checksummed separately.
There's one small issue: the initial packets aren't checksummed while the connection is established. Long story short there's an unrelated bug that causes a Ceph client to crash when the authorizer_size field in the connection_reply packet is corrupted; there's an assert on it being < 4096 bytes.

So something bad happened on the NIC, userspace got junk data, some OSDs crashed. Sometimes a server went offline because it lost both links.

So far a good stress test for Ceph's self-healing capabilities and everything worked despite crashing OSDs and servers randomly going offline.
This is great so far, but then something happened that shouldn't:

rados reported 5 unfound objects in one PG. Checking details on the missing objects showed:

  • affected PG was in the erasure coded cephfs pool
  • all objects still had 2 shards (out of 8 shards) available
  • the 2 OSDs that still had shards didn't crash (but others from the PG did)

so I've checked which files were affected, turns out all of them had been deleted so no data was lost.

Looks like RADOS somehow got confused about the deletion state of these objects.

It kind of reminds me of https://tracker.ceph.com/issues/44286 but the pre-conditions and environment tirggering this seem very different

Actions #1

Updated by Neha Ojha almost 4 years ago

  • Status changed from New to Need More Info

Is cache tiering involved here too? Do you have any osd logs from the same time?

Actions #2

Updated by Paul Emmerich almost 4 years ago

No, this setup is luckily without any cache tiering. It's a completely standard setup with replicated cephfs_metadata and cephfs_data and an additional cephfs_data_erasure coded pool. Sorry for the confusion.

The network issue was resolved by swapping cables and I've unfortuantely no way to reproduce this...

Actions

Also available in: Atom PDF