Project

General

Profile

Actions

Bug #20243

closed

Improve size scrub error handling and ignore system attrs in xattr checking

Added by David Zafman almost 7 years ago. Updated about 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
Scrub/Repair
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Something similar to this was seen on a production system. If all the object_info_t matched there would be no errors from list-inconsistent-obj.

shard  disk size     oi size
0          1588       1588
1          1588       1588
2          1588          0

{
    "epoch": 17,
    "inconsistents": [
        {
            "object": {
                "name": "foo",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 1
            },
            "errors": [
                "object_info_inconsistency",
                "attr_value_mismatch" 
            ],
            "union_shard_errors": [],
            "selected_object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 0,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                },
                {
                    "osd": 1,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                },
                {
                    "osd": 2,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 0 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                }
            ]
        }
    ]
}

Currently all we see is object_info_inconsistency and attr_value_mismatch and no shard errors. Without snapshots there is no info from list-inconsistent-snapset which included some additional size checking.

In be_select_auth_object we should check for a shards disk size vs oi_size. This should be a new disk_size_shard error. This would make that shard less likely to be the authoritative one.
We should ignore system xattrs when checking for attr_value_mismatch. We will ignore strange xattr keys and never report a attr_name_mismatch.

Already present in the code:
We have object error size_mismatch when different shard don't have the same disk size (maybe rename to disk_size_mismatch too?)
We have shard error size_mismatch_oi which like other _oi errors means the disk size doesn't match the authoritative size


Related issues 2 (0 open2 closed)

Related to Ceph - Feature #18836: list-inconsistent-obj should show which osd is the primaryResolvedDavid Zafman02/06/2017

Actions
Copied to RADOS - Backport #21051: luminous: Improve size scrub error handling and ignore system attrs in xattr checkingResolvedAbhishek LekshmananActions
Actions #1

Updated by David Zafman almost 7 years ago

  • Description updated (diff)
Actions #2

Updated by David Zafman almost 7 years ago

  • Description updated (diff)
Actions #3

Updated by David Zafman almost 7 years ago

  • Description updated (diff)
Actions #4

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category set to Scrub/Repair
  • Component(RADOS) OSD added
Actions #5

Updated by David Zafman almost 7 years ago

  • Status changed from New to Fix Under Review
Actions #6

Updated by David Zafman over 6 years ago

  • Backport set to luminous
Actions #7

Updated by David Zafman over 6 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by David Zafman over 6 years ago

  • Related to Feature #18836: list-inconsistent-obj should show which osd is the primary added
Actions #9

Updated by Nathan Cutler over 6 years ago

  • Copied to Backport #21051: luminous: Improve size scrub error handling and ignore system attrs in xattr checking added
Actions #10

Updated by David Zafman over 6 years ago

If we wanted to backport to Jewel it would be helpful to include this pull request first.

https://github.com/ceph/ceph/pull/15559

Actions #11

Updated by David Zafman about 6 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF