Bug #21876: Command failed qa/standalone/scrub/osd-recovery-scrub.sh - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #21876

closed

Command failed qa/standalone/scrub/osd-recovery-scrub.sh

Added by huang jun over 6 years ago. Updated over 6 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

David Zafman

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v12.2.1

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

In this test, pg 1.10 not scrubbed until 300s timedout.
PG 1.10 registered when pg at backfilling state,
it wasn't scheduled even after backfill finished.
The pg restart peering procedure, which will unregister the pg from scrub queue,
and set scrubber.must_scrub to false when doing on_change(),
then register the pg to scrub queue, but at this time,
the next scrub time was set to hours later bc the scrubber.must_scrub=false.
that finally fail the test.

History
Notes
Property changes

Actions

Copy link

Updated by huang jun over 6 years ago

@dzafman can we not reset scrubber.must_scrub and set it to false in scrub_finish?

Actions

Copy link

Updated by Sage Weil over 6 years ago

If I'm following, I'm not sure that leaving must_scrub set until teh end of the scrub will help. I think the problem is that scrub_clear_state() was called when backfill finished and peering reset. That wouldn't change... and probably shouldn't, since we don't try to preserve the must_scrub flag across peering intervals. I'm not sure if we should?

Actions

Copy link

Updated by huang jun over 6 years ago

so the pg may be scheduled hours later and fail the test case,
so it better to modify the qa case?

Actions

Copy link

Updated by David Zafman over 6 years ago

Assignee set to David Zafman

Actions

Copy link

Updated by David Zafman over 6 years ago

Conversation on IRC:

davidz
I don't understand you description at all. Scrubbing isn't supposed to happen in the test. We are trying to scrub and expecting, for example, error messages "not scheduling scrubs due to active recovery".
Please include the test output that fails in this tracker.

7:02
huangjun
7:02
63 ceph osd pool set $poolname size 4
68 run_in_background pids pg_scrub $poolid.$(echo "{ obase=16; $pg }" | bc | tr '[:upper:]' '[:lower:]')
that will do scrub?

7:04
davidz
Yes, but we are testing not allowing scrub during recovery. 63 -> Create a recovery requirement 68 -> Try to scrub … Later check for error messages. (No scrubbing should happen). That can't be what is causing a test failure.

7:07
huangjun
https://pastebin.mozilla.org/9073805

7:08
davidz
Hmm…I'm running the test now…weird now I don't know why it is using the pg_scrub function which waits for the scrub stamp to advance, but maybe it shouldn't advance.

7:09
huangjun
do you run qa test script locally or your own scripts?

7:10
davidz
When I use run-standalone.sh to run it, it passes even though it uses pg_scrub function which waits for scrub. Not sure why that works actually.
If we change the pg_scrub to "ceph pg scrub …." instead we might need to add a short delay before checking for the error messages.

7:12
huangjun
should we add 'pg_scrub' here ?
what's the point to check here?
i think we should wait all pg active+clean, and the do recovery scrub test?

7:15
davidz
The current tests creates a pool and uses wait_for_clean to make sure all PGs are active+clean. Unless on slow machines the sleep 1 in create_pool needs to be increased.

7:17
huangjun
uhh, i mean after we set pool size to 4, we should wait pg active+clean, but not do pg_scrub
actually, the scrub will spend hours during backfill

7:19
davidz
That defeats the purpose of the test. Backfill doesn't necessarily take a long time. We only create 4 objects. It will do recovery and with 4 objects it is very fast.
As a matter of fact on a very fast machine, the test could fail to find the error messages, if the recovery happens so fast that all PGs are active+clean before the scrub starts. On a very fast machine, you should increase the OBJECTS value. But in that case the test will fail with "Missing log message …."
Test like this:
../qa/run-standalone.sh osd-recovery-scrub.sh 2>&1 | tee ors.log

7:22
huangjun
yes, backfill is fast, but the scrub have a problem described in description, the must_scrub will reset to false after on_change, the scrub will scheduled hours later

7:22
davidz
If the test fails attache the log to the tracker.
In my test run scrub happens in less than a minute: 99700: ../qa/standalone/ceph-helpers.sh:1616: wait_for_scrub: test '2017-11-29 19:17:35.858171' '>' '2017-11-29 19:16:52.324631'
I don't know how you relate what this test is trying to do with the internal value "must_scrub"

7:25
huangjun
ok, i will run the test locally

Actions

Copy link

Updated by David Zafman over 6 years ago

The pastebin is basically wait_for_scrub timing out.

2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:19533: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1599: wait_for_scrub: (( i < 300 ))
2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:19533: /home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1605: wait_for_scrub: return 1
2017-11-30T00:52:38.916 INFO:tasks.workunit.client.0.mira037.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1715: run_in_background: return 1

None of the PGs timeout for me. Here we see one of the PGs scrub in less than a minute.

99700: ../qa/standalone/ceph-helpers.sh:1616: wait_for_scrub: test '2017-11-29 19:17:35.858171' '>' '2017-11-29 19:16:52.324631'

Actions

Copy link

Updated by huang jun over 6 years ago

can you try again by set OBJECTS number to 16

Actions

Copy link

Updated by David Zafman over 6 years ago

It passes with OBJECTS=16

Actions

Copy link

Updated by David Zafman over 6 years ago

Status changed from New to Can't reproduce

Reopen if there is a way to reproduce this reliably.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #21876

Command failed qa/standalone/scrub/osd-recovery-scrub.sh

Updated by huang jun over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by huang jun over 6 years ago

Updated by David Zafman over 6 years ago

Updated by David Zafman over 6 years ago

Updated by David Zafman over 6 years ago

Updated by huang jun over 6 years ago

Updated by David Zafman over 6 years ago

Updated by David Zafman over 6 years ago