Bug #56034
qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3()
0%
Description
/a/yuriw-2022-06-13_16:36:31-rados-wip-yuri7-testing-2022-06-13-0706-distro-default-smithi/6876516
Also historically seen in:
/a/yuriw-2022-03-18_00:42:20-rados-wip-yuri6-testing-2022-03-17-1547-distro-default-smithi/6743671
Failure Reason:
Command failed (workunit test osd/divergent-priors.sh) on smithi121 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=8f8ee34a603512f9baef4f6eff0e1fc5deb2cda4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh'
Logs of Interest:
During the beginning of test TEST_divergent_3(), after the pool 'test' is created and the 'wait_for_clean' stage, the 'pg dump' output shows a bunch of PGs in unknown state with 'up_primary' field is set to -1. The current test code just picks the first PG from the pg dump output which turns out to be a PG in 'unknown' state and with 'up_primary' set to -1. The test subsequently fails because of this invalid osd id (-1). The logic in the test perhaps needs to pick the correct PG in the active+clean state i.e. pg 1.0 in this case. See logs below:
2022-06-13T17:15:18.882 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1667: wait_for_clean: get_num_pgs 2022-06-13T17:15:18.883 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs: ceph --format json status 2022-06-13T17:15:18.883 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs: jq .pgmap.num_pgs 2022-06-13T17:15:19.392 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1667: wait_for_clean: test 1 == 0 2022-06-13T17:15:19.393 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1671: wait_for_clean: true 2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1675: wait_for_clean: get_num_active_clean 2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1382: get_num_active_clean: local expression 2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1383: get_num_active_clean: expression+='select(contains("active") and contains("clean")) | ' 2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1384: get_num_active_clean: expression+='select(contains("stale") | not)' 2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1385: get_num_active_clean: ceph --format json pg dump pgs 2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1386: get_num_active_clean: jq '.pg_stats | [.[] | .state | select(contains("active") and contains("clean")) | select(contains("stale") | not)] | length' 2022-06-13T17:15:19.820 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1675: wait_for_clean: cur_active_clean=1 2022-06-13T17:15:19.822 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean: get_num_pgs 2022-06-13T17:15:19.823 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs: ceph --format json status 2022-06-13T17:15:19.823 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs: jq .pgmap.num_pgs 2022-06-13T17:15:20.348 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean: test 1 = 1 2022-06-13T17:15:20.349 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean: break 2022-06-13T17:15:20.349 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1691: wait_for_clean: return 0 2022-06-13T17:15:20.351 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3: ceph pg dump pgs --format=json 2022-06-13T17:15:20.351 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3: jq '.pg_stats[0].up_primary' 2022-06-13T17:15:20.738 INFO:tasks.workunit.client.0.smithi121.stderr:dumped pgs 2022-06-13T17:15:20.747 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3: local divergent=-1 2022-06-13T17:15:20.748 INFO:tasks.workunit.client.0.smithi121.stdout:primary and soon to be divergent is -1 2022-06-13T17:15:20.748 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:658: TEST_divergent_3: echo 'primary and soon to be divergent is -1' 2022-06-13T17:15:20.749 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:659: TEST_divergent_3: ceph pg dump pgs 2022-06-13T17:15:21.142 INFO:tasks.workunit.client.0.smithi121.stdout:PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED 2022-06-13T17:15:21.142 INFO:tasks.workunit.client.0.smithi121.stdout:1.1f 0 0 0 0 0 0 0 0 0 0 unknown 2022-06-13T17:15:19.879154+0000 0'0 0:0 [] -1 [] -1 0'0 2022-06-13T17:15:19.879154+0000 0'0 2022-06-13T17:15:19.879154+0000 0 0 -- 0 0 ... ... 2022-06-13T17:15:21.149 INFO:tasks.workunit.client.0.smithi121.stdout:1.0 0 0 0 0 0 0 0 0 0 0 active+clean 2022-06-13T17:15:08.240142+0000 0'0 28:12 [1,0,2] 1 [1,0,2] 1 0'0 2022-06-13T17:15:07.094983+0000 0'0 2022-06-13T17:15:07.094983+0000 0 0 periodic scrub scheduled @ 2022-06-15T01:14:09.779610+0000 0 0 2022-06-13T17:15:21.149 INFO:tasks.workunit.client.0.smithi121.stdout:1.d 0 0 0 0 0 0 0 0 0 0 unknown 2022-06-13T17:15:19.879154+0000 0'0 0:0 [] -1 [] -1 0'0 2022-06-13T17:15:19.879154+0000 0'0 2022-06-13T17:15:19.879154+0000 0 0 -- 0 0 ...
Other Observations:
During the creation of pools in the tests, the logs indicate that the pool already exists:
2022-06-13T17:05:55.189 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:75: TEST_divergent: create_pool test 1 1 2022-06-13T17:05:55.190 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:541: create_pool: ceph osd pool create test 1 1 2022-06-13T17:05:55.858 INFO:tasks.workunit.client.0.smithi121.stderr:pool 'test' already exists ...
But at the end of the test when the same pool is attempted to be deleted as part of the clean-up, the test reports that the pool doesn't exist!
2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:221: TEST_divergent: echo success 2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:223: TEST_divergent: delete_pool test 2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:546: delete_pool: local poolname=test 2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:547: delete_pool: ceph osd pool delete test test --yes-i-really-really-mean-it 2022-06-13T17:08:10.829 INFO:tasks.workunit.client.0.smithi121.stderr:pool 'test' does not exist ...
Related issues
History
#1 Updated by Laura Flores over 1 year ago
Another detail to note is that this particular test has the pg autoscaler enabled, as opposed to TEST_divergent_2(), which is the same test with the autoscaler disabled. (see https://github.com/ceph/ceph/blob/390e2a0856d3f0870a8845d8d919d9e50ba4ff48/qa/standalone/osd/divergent-priors.sh#L625-L626)
#2 Updated by Kamoltat (Junior) Sirivadhna over 1 year ago
/a/yuriw-2022-06-29_13:30:16-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6905537
#3 Updated by Radoslaw Zarzynski over 1 year ago
This looks like a test failure, so nor terribly high priority.
#4 Updated by Kamoltat (Junior) Sirivadhna 12 months ago
/a/ksirivad-2022-12-15_06:28:05-rados-wip-ksirivad-testing-main-distro-default-smithi/7118004/
#5 Updated by Laura Flores 8 months ago
- Tags set to test-failure
/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934
#6 Updated by Radoslaw Zarzynski 8 months ago
The direct root cause of the failure is that we tried to start an OSD with ID set to -1
:
rzarzynski@teuthology:/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934$ find ./ | grep osd ./remote/smithi072/log/osd.0.log ./remote/smithi072/log/osd.1.log ./remote/smithi072/log/osd.2.log ./remote/smithi072/log/osd.-1.log
rzarzynski@teuthology:/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934$ cat ./remote/smithi072/log/osd.-1.log 2023-03-27T07:22:24.606+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:24.608+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:24.608+0000 7f7c6ff76640 0 ceph version 18.0.0-3182-g5e717292 (5e717292106ca2d310770101bfebb345837be8e1) reef (dev), process ceph-osd, pid 133908 2023-03-27T07:22:24.609+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:24.609+0000 7f7c6ff76640 -1 must specify '-i #' where # is the osd number 2023-03-27T07:22:34.166+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:34.168+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:34.168+0000 7f5654d05640 0 ceph version 18.0.0-3182-g5e717292 (5e717292106ca2d310770101bfebb345837be8e1) reef (dev), process ceph-osd, pid 134194 2023-03-27T07:22:34.169+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled. 2023-03-27T07:22:34.169+0000 7f5654d05640 -1 must specify '-i #' where # is the osd number
2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:757: TEST_divergent_3: activate_osd td/divergent-priors -1 2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:845: activate_osd: local dir=td/divergent-priors 2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:846: activate_osd: shift 2023-03-27T07:22:24.574 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:847: activate_osd: local id=-1
# this is the same as case _2 above, except we enable pg autoscaling in order # to reproduce https://tracker.ceph.com/issues/41816 function TEST_divergent_3() { # ... # determine primary local divergent="$(ceph pg dump pgs --format=json | jq '.pg_stats[0].up_primary')" # ... activate_osd $dir $divergent
#7 Updated by Neha Ojha 8 months ago
This is where we got -1 from pg_stats
2023-03-27T07:21:12.611 INFO:tasks.workunit.client.0.smithi072.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3: jq '.pg_stats[0].up_primary' 2023-03-27T07:21:13.040 INFO:tasks.workunit.client.0.smithi072.stderr:dumped pgs 2023-03-27T07:21:13.050 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3: local divergent=-1 2023-03-27T07:21:13.051 INFO:tasks.workunit.client.0.smithi072.stdout:primary and soon to be divergent is -1
#9 Updated by Neha Ojha 8 months ago
Running the same command again on gibba gives a valid out, and now the PG is active+clean. So this probably has to do with pg state, so when the PG isn't active+clean we might end up with -1.
[root@gibba001 ~]# ceph pg dump pgs --format=json | jq '.pg_stats[0].up_primary' dumped pgs 48
#10 Updated by Laura Flores 6 months ago
/a/yuriw-2023-05-30_21:40:46-rados-wip-yuri10-testing-2023-05-30-1244-distro-default-smithi/7291436
lflores@teuthology:/a/yuriw-2023-05-30_21:40:46-rados-wip-yuri10-testing-2023-05-30-1244-distro-default-smithi/7291436$ find ./ | grep osd
./remote/smithi018/log/osd.0.log
./remote/smithi018/log/osd.1.log
./remote/smithi018/log/osd.2.log
./remote/smithi018/log/osd.-1.log
#11 Updated by Nitzan Mordechai 6 months ago
- Status changed from New to In Progress
- Assignee set to Nitzan Mordechai
since TEST_divergent_3 have pg_autoscale_mode on not all the pgs are clean+active when we pick the divergent osd which lead to:
2023-05-31T13:43:37.909 INFO:tasks.workunit.client.0.smithi018.stdout:primary and soon to be divergent is -1
we need to pick osd between 0-2
#12 Updated by Nitzan Mordechai 6 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 51909
#13 Updated by Matan Breizman 4 months ago
- Backport set to reef
/a/yuriw-2023-07-18_14:02:17-rados-wip-yuri6-testing-2023-07-17-0838-reef-distro-default-smithi/7342962
51909 may be needed in R as well?
#14 Updated by Aishwarya Mathuria 4 months ago
/a/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355532
#15 Updated by Radoslaw Zarzynski 4 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from reef to reef,quincy
#16 Updated by Backport Bot 4 months ago
- Copied to Backport #62251: reef: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3() added
#17 Updated by Backport Bot 4 months ago
- Copied to Backport #62252: quincy: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3() added
#18 Updated by Backport Bot 4 months ago
- Tags set to backport_processed