Conversation on IRC:
davidz
I don't understand you description at all. Scrubbing isn't supposed to happen in the test. We are trying to scrub and expecting, for example, error messages "not scheduling scrubs due to active recovery".
Please include the test output that fails in this tracker.
7:02
huangjun
7:02
63 ceph osd pool set $poolname size 4
68 run_in_background pids pg_scrub $poolid.$(echo "{ obase=16; $pg }" | bc | tr '[:upper:]' '[:lower:]')
that will do scrub?
7:04
davidz
Yes, but we are testing not allowing scrub during recovery. 63 -> Create a recovery requirement 68 -> Try to scrub … Later check for error messages. (No scrubbing should happen). That can't be what is causing a test failure.
7:07
huangjun
https://pastebin.mozilla.org/9073805
7:08
davidz
Hmm…I'm running the test now…weird now I don't know why it is using the pg_scrub function which waits for the scrub stamp to advance, but maybe it shouldn't advance.
7:09
huangjun
do you run qa test script locally or your own scripts?
7:10
davidz
When I use run-standalone.sh to run it, it passes even though it uses pg_scrub function which waits for scrub. Not sure why that works actually.
If we change the pg_scrub to "ceph pg scrub …." instead we might need to add a short delay before checking for the error messages.
7:12
huangjun
should we add 'pg_scrub' here ?
what's the point to check here?
i think we should wait all pg active+clean, and the do recovery scrub test?
7:15
davidz
The current tests creates a pool and uses wait_for_clean to make sure all PGs are active+clean. Unless on slow machines the sleep 1 in create_pool needs to be increased.
7:17
huangjun
uhh, i mean after we set pool size to 4, we should wait pg active+clean, but not do pg_scrub
actually, the scrub will spend hours during backfill
7:19
davidz
That defeats the purpose of the test. Backfill doesn't necessarily take a long time. We only create 4 objects. It will do recovery and with 4 objects it is very fast.
As a matter of fact on a very fast machine, the test could fail to find the error messages, if the recovery happens so fast that all PGs are active+clean before the scrub starts. On a very fast machine, you should increase the OBJECTS value. But in that case the test will fail with "Missing log message …."
Test like this:
../qa/run-standalone.sh osd-recovery-scrub.sh 2>&1 | tee ors.log
7:22
huangjun
yes, backfill is fast, but the scrub have a problem described in description, the must_scrub will reset to false after on_change, the scrub will scheduled hours later
7:22
davidz
If the test fails attache the log to the tracker.
In my test run scrub happens in less than a minute: 99700: ../qa/standalone/ceph-helpers.sh:1616: wait_for_scrub: test '2017-11-29 19:17:35.858171' '>' '2017-11-29 19:16:52.324631'
I don't know how you relate what this test is trying to do with the internal value "must_scrub"
7:25
huangjun
ok, i will run the test locally