Project

General

Profile

Bug #25108

object errors found in be_select_auth_object() aren't logged the same

Added by David Zafman 5 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
07/25/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

object errors found in be_select_auth_object() aren't logged the same as errors found in be_compare_scrub_objects(). The value of errorstream which is return from be_compare_scrubmaps() doesn't get the errors from be_select_auth_object(). They are just logged with dout at the end.


Related issues

Copied to RADOS - Backport #32106: luminous: object errors found in be_select_auth_object() aren't logged the same Resolved
Copied to RADOS - Backport #32108: mimic: object errors found in be_select_auth_object() aren't logged the same Resolved

History

#1 Updated by David Zafman 5 months ago

  • Subject changed from object errors found in be_select_auth_object() aren't logged that same to object errors found in be_select_auth_object() aren't logged the same

#2 Updated by David Zafman 5 months ago

  • Description updated (diff)

#3 Updated by David Zafman 5 months ago

I ran a subtest of osd-scrub-repair based on pull request https://github.com/ceph/ceph/pull/23217. I also added a grep log_channel $dir/osd.*.log to the test script.

$ ../qa/run-standalone.sh "osd-scrub-repair.sh TEST_corrupt_scrub_replicated" 2>&1 | tee osr.log
$ grep log_channel.*ROBJ18 osr.log
2018-07-25 20:24:36.736 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 shard 0: soid 3:33aca486:::ROBJ18:head data_digest 0xbd89c912 != data_digest 0x2ddbf8f5 from auth oi 3:33aca486:::ROBJ18:head(58'56 osd.1.0:55 dirty|omap|data_digest|omap_digest s 7 uv 54 dd 2ddbf8f5 od ddc3680f alloc_hint [0 0 255])
2018-07-25 20:24:36.736 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 shard 1: soid 3:33aca486:::ROBJ18:head data_digest 0xbd89c912 != data_digest 0x2ddbf8f5 from auth oi 3:33aca486:::ROBJ18:head(58'56 osd.1.0:55 dirty|omap|data_digest|omap_digest s 7 uv 54 dd 2ddbf8f5 od ddc3680f alloc_hint [0 0 255])
2018-07-25 20:24:36.736 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 soid 3:33aca486:::ROBJ18:head: failed to pick suitable auth object
2018-07-25 20:24:43.743 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 shard 0: soid 3:33aca486:::ROBJ18:head data_digest 0xbd89c912 != data_digest 0x2ddbf8f5 from auth oi 3:33aca486:::ROBJ18:head(58'56 osd.1.0:55 dirty|omap|data_digest|omap_digest s 7 uv 54 dd 2ddbf8f5 od ddc3680f alloc_hint [0 0 255])
2018-07-25 20:24:43.743 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 : soid 3:33aca486:::ROBJ18:head repairing object info data_digest
2018-07-25 20:24:43.743 7fab59f21700 -1 log_channel(cluster) log [ERR] : 3.0 : soid 3:33aca486:::ROBJ18:head data_digest 0xbd89c912 != data_digest 0x2ddbf8f5 from auth oi 3:33aca486:::ROBJ18:head(58'56 osd.1.0:55 dirty|omap|data_digest|omap_digest s 7 uv 54 dd 2ddbf8f5 od ddc3680f alloc_hint [0 0 255])

The first 3 log lines ending in "failed to pick suitable auth object" are the deep-scrub which doesn't out anything about the object_info_inconsistency in the cluster log. The second 3 lines is the repair. Since we cleared the shard error, it outputs shard 1's error without the shard id and again nothing about the object_info_inconsistency.

#4 Updated by David Zafman 4 months ago

  • Status changed from New to In Progress
  • Backport set to mimic, luminous

#5 Updated by David Zafman 4 months ago

Kefu:
my concern is that, we don't reset object_error before moving to another ScrubMap. so once we identify an error in a certain shard of current object, we will always go to this branch if there is no shard error in succeeding shards. but the reason we go to this branch is not necessary that the object_error.errors is contributed by current shard.
Zafman
Good catch, but will require a follow-on fix. I now realize that we had a bug when I introduced some inconsistency checking in be_select_auth_object(). This change actually improves repair of those inconsistencies at the expense of repairing all shards instead of just the ones that don't match the authoritative shard. This is a result of not resetting object_error.errors.

I had already filed a tracker http://tracker.ceph.com/issues/25108 to cover a logging issue with errors found in be_select_auth_object(). But really I need to move all error checking out of be_select_auth_object() and into be_compare_scrub_objects(). This will fix the logging issue. Then we can return to } else if (found) { which is only needed for logging. At that point the set<pg_shard_t> object_errors can go away if auth_list can't be empty. I'm not totally sure about that yet.

#6 Updated by Kefu Chai 4 months ago

  • Status changed from In Progress to Need Review

#7 Updated by David Zafman 4 months ago

  • Status changed from Need Review to Testing

#8 Updated by David Zafman 4 months ago

  • Status changed from Testing to Pending Backport

#9 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #32106: luminous: object errors found in be_select_auth_object() aren't logged the same added

#10 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #32108: mimic: object errors found in be_select_auth_object() aren't logged the same added

#11 Updated by Nathan Cutler 3 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF