Project

General

Profile

Bug #20243

Improve size scrub error handling and ignore system attrs in xattr checking

Added by David Zafman over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Scrub/Repair
Target version:
-
Start date:
06/09/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:

Description

Something similar to this was seen on a production system. If all the object_info_t matched there would be no errors from list-inconsistent-obj.

shard  disk size     oi size
0          1588       1588
1          1588       1588
2          1588          0

{
    "epoch": 17,
    "inconsistents": [
        {
            "object": {
                "name": "foo",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 1
            },
            "errors": [
                "object_info_inconsistency",
                "attr_value_mismatch" 
            ],
            "union_shard_errors": [],
            "selected_object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 0,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                },
                {
                    "osd": 1,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 1588 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                },
                {
                    "osd": 2,
                    "errors": [],
                    "size": 1588,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xa9a36536",
                    "object_info": "0:602f83fe:::foo:head(12'1 client.4111.0:1 dirty|data_digest|omap_digest s 0 uv 1 dd a9a36536 od ffffffff alloc_hint [0 0 0])" 
                }
            ]
        }
    ]
}

Currently all we see is object_info_inconsistency and attr_value_mismatch and no shard errors. Without snapshots there is no info from list-inconsistent-snapset which included some additional size checking.

In be_select_auth_object we should check for a shards disk size vs oi_size. This should be a new disk_size_shard error. This would make that shard less likely to be the authoritative one.
We should ignore system xattrs when checking for attr_value_mismatch. We will ignore strange xattr keys and never report a attr_name_mismatch.

Already present in the code:
We have object error size_mismatch when different shard don't have the same disk size (maybe rename to disk_size_mismatch too?)
We have shard error size_mismatch_oi which like other _oi errors means the disk size doesn't match the authoritative size


Related issues

Related to Ceph - Feature #18836: list-inconsistent-obj should show which osd is the primary Resolved 02/06/2017
Copied to RADOS - Backport #21051: luminous: Improve size scrub error handling and ignore system attrs in xattr checking Resolved

History

#1 Updated by David Zafman over 1 year ago

  • Description updated (diff)

#2 Updated by David Zafman over 1 year ago

  • Description updated (diff)

#3 Updated by David Zafman over 1 year ago

  • Description updated (diff)

#4 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Category set to Scrub/Repair
  • Component(RADOS) OSD added

#5 Updated by David Zafman over 1 year ago

  • Status changed from New to Need Review

#6 Updated by David Zafman over 1 year ago

  • Backport set to luminous

#7 Updated by David Zafman over 1 year ago

  • Status changed from Need Review to Pending Backport

#8 Updated by David Zafman over 1 year ago

  • Related to Feature #18836: list-inconsistent-obj should show which osd is the primary added

#9 Updated by Nathan Cutler about 1 year ago

  • Copied to Backport #21051: luminous: Improve size scrub error handling and ignore system attrs in xattr checking added

#10 Updated by David Zafman about 1 year ago

If we wanted to backport to Jewel it would be helpful to include this pull request first.

https://github.com/ceph/ceph/pull/15559

#11 Updated by David Zafman 10 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF