Project

General

Profile

Actions

Bug #21876

closed

Command failed qa/standalone/scrub/osd-recovery-scrub.sh

Added by huang jun over 6 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
David Zafman
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In this test, pg 1.10 not scrubbed until 300s timedout.
PG 1.10 registered when pg at backfilling state,
it wasn't scheduled even after backfill finished.
The pg restart peering procedure, which will unregister the pg from scrub queue,
and set scrubber.must_scrub to false when doing on_change(),
then register the pg to scrub queue, but at this time,
the next scrub time was set to hours later bc the scrubber.must_scrub=false.
that finally fail the test.

Actions #1

Updated by huang jun over 6 years ago

@dzafman can we not reset scrubber.must_scrub and set it to false in scrub_finish?

Actions #2

Updated by Sage Weil over 6 years ago

If I'm following, I'm not sure that leaving must_scrub set until teh end of the scrub will help. I think the problem is that scrub_clear_state() was called when backfill finished and peering reset. That wouldn't change... and probably shouldn't, since we don't try to preserve the must_scrub flag across peering intervals. I'm not sure if we should?

Actions #3

Updated by huang jun over 6 years ago

so the pg may be scheduled hours later and fail the test case,
so it better to modify the qa case?

Actions #4

Updated by David Zafman over 6 years ago

  • Assignee set to David Zafman
Actions #5

Updated by David Zafman over 6 years ago

Conversation on IRC:

davidz
I don't understand you description at all. Scrubbing isn't supposed to happen in the test. We are trying to scrub and expecting, for example, error messages "not scheduling scrubs due to active recovery".
Please include the test output that fails in this tracker.

7:02
huangjun
7:02
63 ceph osd pool set $poolname size 4
68 run_in_background pids pg_scrub $poolid.$(echo "{ obase=16; $pg }" | bc | tr '[:upper:]' '[:lower:]')
that will do scrub?

7:04
davidz
Yes, but we are testing not allowing scrub during recovery. 63 -> Create a recovery requirement 68 -> Try to scrub … Later check for error messages. (No scrubbing should happen). That can't be what is causing a test failure.

7:07
huangjun
https://pastebin.mozilla.org/9073805

7:08
davidz
Hmm…I'm running the test now…weird now I don't know why it is using the pg_scrub function which waits for the scrub stamp to advance, but maybe it shouldn't advance.

7:09
huangjun
do you run qa test script locally or your own scripts?

7:10
davidz
When I use run-standalone.sh to run it, it passes even though it uses pg_scrub function which waits for scrub. Not sure why that works actually.
If we change the pg_scrub to "ceph pg scrub …." instead we might need to add a short delay before checking for the error messages.

7:12
huangjun
should we add 'pg_scrub' here ?
what's the point to check here?
i think we should wait all pg active+clean, and the do recovery scrub test?

7:15
davidz
The current tests creates a pool and uses wait_for_clean to make sure all PGs are active+clean. Unless on slow machines the sleep 1 in create_pool needs to be increased.

7:17
huangjun
uhh, i mean after we set pool size to 4, we should wait pg active+clean, but not do pg_scrub
actually, the scrub will spend hours during backfill

7:19
davidz
That defeats the purpose of the test. Backfill doesn't necessarily take a long time. We only create 4 objects. It will do recovery and with 4 objects it is very fast.
As a matter of fact on a very fast machine, the test could fail to find the error messages, if the recovery happens so fast that all PGs are active+clean before the scrub starts. On a very fast machine, you should increase the OBJECTS value. But in that case the test will fail with "Missing log message …."
Test like this:
../qa/run-standalone.sh osd-recovery-scrub.sh 2>&1 | tee ors.log

7:22
huangjun
yes, backfill is fast, but the scrub have a problem described in description, the must_scrub will reset to false after on_change, the scrub will scheduled hours later

7:22
davidz
If the test fails attache the log to the tracker.
In my test run scrub happens in less than a minute: 99700: ../qa/standalone/ceph-helpers.sh:1616: wait_for_scrub: test '2017-11-29 19:17:35.858171' '>' '2017-11-29 19:16:52.324631'
I don't know how you relate what this test is trying to do with the internal value "must_scrub"

7:25
huangjun
ok, i will run the test locally

Actions #6

Updated by David Zafman over 6 years ago

The pastebin is basically wait_for_scrub timing out.

2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:19533: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1599: wait_for_scrub: (( i < 300 ))
2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:19533: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1605: wait_for_scrub: return 1
2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1715: run_in_background: return 1

None of the PGs timeout for me. Here we see one of the PGs scrub in less than a minute.

99700: ../qa/standalone/ceph-helpers.sh:1616: wait_for_scrub: test '2017-11-29 19:17:35.858171' '>' '2017-11-29 19:16:52.324631'

Actions #7

Updated by huang jun over 6 years ago

can you try again by set OBJECTS number to 16

Actions #8

Updated by David Zafman over 6 years ago

It passes with OBJECTS=16

Actions #9

Updated by David Zafman over 6 years ago

  • Status changed from New to Can't reproduce

Reopen if there is a way to reproduce this reliably.

Actions

Also available in: Atom PDF