Project

General

Profile

Actions

Bug #13592

closed

ceph-helpers: TEST_auto_repair_erasure_coded intermittent failures

Added by Loïc Dachary over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

./test/osd/osd-scrub-repair.sh:182: TEST_auto_repair_erasure_coded:  objectstore_tool testdir/osd-scrub-repair 0 SOMETHING list-attrs
../qa/workunits/ceph-helpers.sh:792: objectstore_tool:  local dir=testdir/osd-scrub-repair
../qa/workunits/ceph-helpers.sh:793: objectstore_tool:  shift
../qa/workunits/ceph-helpers.sh:794: objectstore_tool:  local id=0
../qa/workunits/ceph-helpers.sh:795: objectstore_tool:  shift
../qa/workunits/ceph-helpers.sh:796: objectstore_tool:  local osd_data=testdir/osd-scrub-repair/0
../qa/workunits/ceph-helpers.sh:798: objectstore_tool:  kill_daemons testdir/osd-scrub-repair TERM osd.0
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons:  local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons:  return 0
../qa/workunits/ceph-helpers.sh:800: objectstore_tool:  ceph-objectstore-tool --data-path testdir/osd-scrub-repair/0 --journal-path testdir/osd-scrub-repair/0/journal SOMETHING list-attrs
No object id 'SOMETHING' found
../qa/workunits/ceph-helpers.sh:802: objectstore_tool:  return 1

Files

log.txt (416 KB) log.txt Loïc Dachary, 10/25/2015 11:11 PM
Actions #8

Updated by Sage Weil over 8 years ago

  • Assignee set to Loïc Dachary
Actions #9

Updated by Loïc Dachary over 8 years ago

  • Status changed from New to Need More Info
Actions #13

Updated by Xinze Chi over 8 years ago

I think the reason is that the scrub is not be scheduled by osd.
Because the sched_scrub is only 33% percent called in every tick (OSD::scrub_random_backoff()). And then it may need more time to wait for scheduling the specially pg scrub(we should loop through the all pgs)

Actions #14

Updated by Xinze Chi over 8 years ago

I think we could schedule the scrub by manually (using the ceph pg scrub command)

Actions #15

Updated by Xinze Chi over 8 years ago

  1. Remove the object from one shard physically
    objectstore_tool $dir $(get_not_primary $poolname SOMETHING) SOMETHING remove || return 1
  1. Give some time for auto repair
    sleep 20
  1. 20s may not be enough to schedule the scrub.

maybe be we could wait until the scrub stamp changed (we could also use timeout strategy which is more than 20s) ?

Actions #16

Updated by Loïc Dachary over 8 years ago

  • Status changed from Need More Info to Fix Under Review
Actions #18

Updated by Loïc Dachary over 8 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF