Bug #10017: OSD wrongly marks object as unfound if only the primary is corrupted for EC pool - Ceph - Ceph

Actions

Copy link

Bug #10017

closed

OSD wrongly marks object as unfound if only the primary is corrupted for EC pool

Added by Guang Yang over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Loïc Dachary

Category:

OSD

Target version:

% Done:

50%

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Recently we observed there was one PG stuck at recovering with one object marked as lost, the scrubbing log showed that only the primary chunk of the object has inconsistency in terms of its stored digest and computed digest, all other chunks are good.

Looking at the implementation, I think the problem comes from the way how PG repairs an object:

oid PG::repair_object(
  const hobject_t& soid, ScrubMap::object *po,
  pg_shard_t bad_peer, pg_shard_t ok_peer)
{
  eversion_t v;
  bufferlist bv;
  bv.push_back(po->attrs[OI_ATTR]);
  object_info_t oi(bv);
  if (bad_peer != primary) {
    peer_missing[bad_peer].add(soid, oi.version, eversion_t());
  } else {
    // We should only be scrubbing if the PG is clean.
    assert(waiting_for_unreadable_object.empty());

    pg_log.missing_add(soid, oi.version, eversion_t());
    missing_loc.add_missing(soid, oi.version, eversion_t());
    missing_loc.add_location(soid, ok_peer);

    pg_log.set_last_requested(0);
  }
}

Here we can see that if the primary is corrupted, it will call to:

  missing_loc.add_location(soid, ok_peer);

So that only one shard (the authoritative shard) is added as the good one, this is fine for replication, however, for EC, when it checks if the object is recoverable:

ECRecPred::operator()(onst set<pg_shard_t> &_have)

It will need to check if enough good chunks are there to determine if the object is recoverable or not, as a result, it alway fail for EC as only one chunk (shard) was added.

Ceph version: 0.80.4
Platform: RHEL6

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #10017

OSD wrongly marks object as unfound if only the primary is corrupted for EC pool

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Samuel Just over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Guang Yang over 9 years ago

Updated by Samuel Just over 9 years ago