Project

General

Profile

Actions

Bug #56034

closed

qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3()

Added by Sridhar Seshasayee almost 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Category:
Tests
Target version:
% Done:

100%

Source:
Tags:
backport_processed
Backport:
reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-06-13_16:36:31-rados-wip-yuri7-testing-2022-06-13-0706-distro-default-smithi/6876516

Also historically seen in:
/a/yuriw-2022-03-18_00:42:20-rados-wip-yuri6-testing-2022-03-17-1547-distro-default-smithi/6743671

Failure Reason:

Command failed (workunit test osd/divergent-priors.sh) on smithi121 with status 1: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=8f8ee34a603512f9baef4f6eff0e1fc5deb2cda4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh'


Logs of Interest:

During the beginning of test TEST_divergent_3(), after the pool 'test' is created and the 'wait_for_clean' stage, the 'pg dump' output shows a bunch of PGs in unknown state with 'up_primary' field is set to -1. The current test code just picks the first PG from the pg dump output which turns out to be a PG in 'unknown' state and with 'up_primary' set to -1. The test subsequently fails because of this invalid osd id (-1). The logic in the test perhaps needs to pick the correct PG in the active+clean state i.e. pg 1.0 in this case. See logs below:

2022-06-13T17:15:18.882 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1667: wait_for_clean:  get_num_pgs
2022-06-13T17:15:18.883 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs:  ceph --format json status
2022-06-13T17:15:18.883 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs:  jq .pgmap.num_pgs
2022-06-13T17:15:19.392 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1667: wait_for_clean:  test 1 == 0
2022-06-13T17:15:19.393 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1671: wait_for_clean:  true
2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1675: wait_for_clean:  get_num_active_clean
2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1382: get_num_active_clean:  local expression
2022-06-13T17:15:19.394 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1383: get_num_active_clean:  expression+='select(contains("active") and contains("clean")) | '
2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1384: get_num_active_clean:  expression+='select(contains("stale") | not)'
2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1385: get_num_active_clean:  ceph --format json pg dump pgs
2022-06-13T17:15:19.395 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1386: get_num_active_clean:  jq '.pg_stats | [.[] | .state | select(contains("active") and contains("clean")) | select(contains("stale") | not)] | length'
2022-06-13T17:15:19.820 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1675: wait_for_clean:  cur_active_clean=1
2022-06-13T17:15:19.822 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean:  get_num_pgs
2022-06-13T17:15:19.823 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs:  ceph --format json status
2022-06-13T17:15:19.823 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1443: get_num_pgs:  jq .pgmap.num_pgs
2022-06-13T17:15:20.348 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean:  test 1 = 1
2022-06-13T17:15:20.349 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1676: wait_for_clean:  break
2022-06-13T17:15:20.349 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1691: wait_for_clean:  return 0
2022-06-13T17:15:20.351 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3:  ceph pg dump pgs --format=json
2022-06-13T17:15:20.351 INFO:tasks.workunit.client.0.smithi121.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3:  jq '.pg_stats[0].up_primary'
2022-06-13T17:15:20.738 INFO:tasks.workunit.client.0.smithi121.stderr:dumped pgs
2022-06-13T17:15:20.747 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3:  local divergent=-1
2022-06-13T17:15:20.748 INFO:tasks.workunit.client.0.smithi121.stdout:primary and soon to be divergent is -1
2022-06-13T17:15:20.748 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:658: TEST_divergent_3:  echo 'primary and soon to be divergent is -1'
2022-06-13T17:15:20.749 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:659: TEST_divergent_3:  ceph pg dump pgs
2022-06-13T17:15:21.142 INFO:tasks.workunit.client.0.smithi121.stdout:PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG  DISK_LOG  STATE         STATE_STAMP                      VERSION  REPORTED  UP       UP_PRIMARY  ACTING   ACTING_PRIMARY  LAST_SCRUB  SCRUB_STAMP                      LAST_DEEP_SCRUB  DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN  LAST_SCRUB_DURATION  SCRUB_SCHEDULING                                            OBJECTS_SCRUBBED  OBJECTS_TRIMMED
2022-06-13T17:15:21.142 INFO:tasks.workunit.client.0.smithi121.stdout:1.1f           0                   0         0          0        0      0            0           0    0         0       unknown  2022-06-13T17:15:19.879154+0000      0'0       0:0       []          -1       []              -1         0'0  2022-06-13T17:15:19.879154+0000              0'0  2022-06-13T17:15:19.879154+0000              0                    0  --                                                                         0                0

...
...

2022-06-13T17:15:21.149 INFO:tasks.workunit.client.0.smithi121.stdout:1.0            0                   0         0          0        0      0            0           0    0         0  active+clean  2022-06-13T17:15:08.240142+0000      0'0     28:12  [1,0,2]           1  [1,0,2]               1         0'0  2022-06-13T17:15:07.094983+0000              0'0  2022-06-13T17:15:07.094983+0000              0                    0  periodic scrub scheduled @ 2022-06-15T01:14:09.779610+0000                 0                0
2022-06-13T17:15:21.149 INFO:tasks.workunit.client.0.smithi121.stdout:1.d            0                   0         0          0        0      0            0           0    0         0       unknown  2022-06-13T17:15:19.879154+0000      0'0       0:0       []          -1       []              -1         0'0  2022-06-13T17:15:19.879154+0000              0'0  2022-06-13T17:15:19.879154+0000              0                    0  --                                                                         0                0

...

Other Observations:

During the creation of pools in the tests, the logs indicate that the pool already exists:

2022-06-13T17:05:55.189 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:75: TEST_divergent:  create_pool test 1 1
2022-06-13T17:05:55.190 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:541: create_pool:  ceph osd pool create test 1 1
2022-06-13T17:05:55.858 INFO:tasks.workunit.client.0.smithi121.stderr:pool 'test' already exists
...

But at the end of the test when the same pool is attempted to be deleted as part of the clean-up, the test reports that the pool doesn't exist!

2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:221: TEST_divergent:  echo success
2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:223: TEST_divergent:  delete_pool test
2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:546: delete_pool:  local poolname=test
2022-06-13T17:08:10.273 INFO:tasks.workunit.client.0.smithi121.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:547: delete_pool:  ceph osd pool delete test test --yes-i-really-really-mean-it
2022-06-13T17:08:10.829 INFO:tasks.workunit.client.0.smithi121.stderr:pool 'test' does not exist
...

Related issues 2 (0 open2 closed)

Copied to RADOS - Backport #62251: reef: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3()ResolvedNitzan MordechaiActions
Copied to RADOS - Backport #62252: quincy: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3()ResolvedNitzan MordechaiActions
Actions #1

Updated by Laura Flores almost 2 years ago

Another detail to note is that this particular test has the pg autoscaler enabled, as opposed to TEST_divergent_2(), which is the same test with the autoscaler disabled. (see https://github.com/ceph/ceph/blob/390e2a0856d3f0870a8845d8d919d9e50ba4ff48/qa/standalone/osd/divergent-priors.sh#L625-L626)

Actions #2

Updated by Kamoltat (Junior) Sirivadhna almost 2 years ago

/a/yuriw-2022-06-29_13:30:16-rados-wip-yuri3-testing-2022-06-28-1737-distro-default-smithi/6905537

Actions #3

Updated by Radoslaw Zarzynski almost 2 years ago

This looks like a test failure, so nor terribly high priority.

Actions #4

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

/a/ksirivad-2022-12-15_06:28:05-rados-wip-ksirivad-testing-main-distro-default-smithi/7118004/

Actions #5

Updated by Laura Flores about 1 year ago

  • Translation missing: en.field_tag_list set to test-failure

/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934

Actions #6

Updated by Radoslaw Zarzynski about 1 year ago

The direct root cause of the failure is that we tried to start an OSD with ID set to -1:

rzarzynski@teuthology:/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934$ find ./ | grep osd
./remote/smithi072/log/osd.0.log
./remote/smithi072/log/osd.1.log
./remote/smithi072/log/osd.2.log
./remote/smithi072/log/osd.-1.log
rzarzynski@teuthology:/a/lflores-2023-03-27_02:17:31-rados-wip-aclamk-bs-elastic-shared-blob-save-25.03.2023-a-distro-default-smithi/7220934$ cat  ./remote/smithi072/log/osd.-1.log
2023-03-27T07:22:24.606+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:24.608+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:24.608+0000 7f7c6ff76640  0 ceph version 18.0.0-3182-g5e717292 (5e717292106ca2d310770101bfebb345837be8e1) reef (dev), process ceph-osd, pid 133908
2023-03-27T07:22:24.609+0000 7f7c6ff76640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:24.609+0000 7f7c6ff76640 -1 must specify '-i #' where # is the osd number
2023-03-27T07:22:34.166+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:34.168+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:34.168+0000 7f5654d05640  0 ceph version 18.0.0-3182-g5e717292 (5e717292106ca2d310770101bfebb345837be8e1) reef (dev), process ceph-osd, pid 134194
2023-03-27T07:22:34.169+0000 7f5654d05640 -1 WARNING: all dangerous and experimental features are enabled.
2023-03-27T07:22:34.169+0000 7f5654d05640 -1 must specify '-i #' where # is the osd number
2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:757: TEST_divergent_3:  activate_osd td/divergent-priors -1
2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:845: activate_osd:  local dir=td/divergent-priors
2023-03-27T07:22:24.573 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:846: activate_osd:  shift
2023-03-27T07:22:24.574 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:847: activate_osd:  local id=-1
# this is the same as case _2 above, except we enable pg autoscaling in order
# to reproduce https://tracker.ceph.com/issues/41816
function TEST_divergent_3() {
    # ...
    # determine primary
    local divergent="$(ceph pg dump pgs --format=json | jq '.pg_stats[0].up_primary')" 

    # ...
    activate_osd $dir $divergent
Actions #7

Updated by Neha Ojha about 1 year ago

This is where we got -1 from pg_stats

2023-03-27T07:21:12.611 INFO:tasks.workunit.client.0.smithi072.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3:  jq '.pg_stats[0].up_primary'
2023-03-27T07:21:13.040 INFO:tasks.workunit.client.0.smithi072.stderr:dumped pgs
2023-03-27T07:21:13.050 INFO:tasks.workunit.client.0.smithi072.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/osd/divergent-priors.sh:657: TEST_divergent_3:  local divergent=-1
2023-03-27T07:21:13.051 INFO:tasks.workunit.client.0.smithi072.stdout:primary and soon to be divergent is -1
Actions #8

Updated by Neha Ojha about 1 year ago

gibba also returns -1 for the same command

[root@gibba001 ~]# ceph pg dump pgs --format=json | jq '.pg_stats[0].up_primary'
dumped pgs
-1

but

"up": [
    59,
    20,
    24
  ],
  "acting": [
    59,
    20,
    24
  ],
...

"up_primary": 59,
  "acting_primary": 59,
Actions #9

Updated by Neha Ojha about 1 year ago

Running the same command again on gibba gives a valid out, and now the PG is active+clean. So this probably has to do with pg state, so when the PG isn't active+clean we might end up with -1.

[root@gibba001 ~]# ceph pg dump pgs --format=json |  jq '.pg_stats[0].up_primary'
dumped pgs
48
Actions #10

Updated by Laura Flores 11 months ago

/a/yuriw-2023-05-30_21:40:46-rados-wip-yuri10-testing-2023-05-30-1244-distro-default-smithi/7291436

lflores@teuthology:/a/yuriw-2023-05-30_21:40:46-rados-wip-yuri10-testing-2023-05-30-1244-distro-default-smithi/7291436$ find ./ | grep osd
./remote/smithi018/log/osd.0.log
./remote/smithi018/log/osd.1.log
./remote/smithi018/log/osd.2.log
./remote/smithi018/log/osd.-1.log
Actions #11

Updated by Nitzan Mordechai 11 months ago

  • Status changed from New to In Progress
  • Assignee set to Nitzan Mordechai

since TEST_divergent_3 have pg_autoscale_mode on not all the pgs are clean+active when we pick the divergent osd which lead to:

2023-05-31T13:43:37.909 INFO:tasks.workunit.client.0.smithi018.stdout:primary and soon to be divergent is -1

we need to pick osd between 0-2

Actions #12

Updated by Nitzan Mordechai 11 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 51909
Actions #13

Updated by Matan Breizman 9 months ago

  • Backport set to reef

/a/yuriw-2023-07-18_14:02:17-rados-wip-yuri6-testing-2023-07-17-0838-reef-distro-default-smithi/7342962

51909 may be needed in R as well?

Actions #14

Updated by Aishwarya Mathuria 9 months ago

/a/yuriw-2023-07-28_14:25:29-rados-wip-yuri7-testing-2023-07-27-1336-quincy-distro-default-smithi/7355532

Actions #15

Updated by Radoslaw Zarzynski 9 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from reef to reef,quincy
Actions #16

Updated by Backport Bot 9 months ago

  • Copied to Backport #62251: reef: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3() added
Actions #17

Updated by Backport Bot 9 months ago

  • Copied to Backport #62252: quincy: qa/standalone/osd/divergent-priors.sh fails in test TEST_divergent_3() added
Actions #18

Updated by Backport Bot 9 months ago

  • Tags set to backport_processed
Actions #19

Updated by Konstantin Shalygin 3 months ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v19.0.0
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF