Project

General

Profile

Bug #20465

Race seen between pool creation and wait_for_clean(): seen in test-erasure-eio.sh

Added by David Zafman over 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
06/29/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I suspect that if creating a pool takes too long before the initial pgs exist in "creating" state then calling wait_for_clean() will return immediately without actually waiting for the PGs to be created.

The error "Invalid poolpool-jerasure" means that injectdataerr didn't find the pool named "pool-jerasure." The whole purpose of the wait_for_clean() is that the pool exists and all PGs for it are ready.

/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:261: TEST_rados_get_subread_eio_shard_1:  local poolname=pool-jerasure
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:262: TEST_rados_get_subread_eio_shard_1:  create_erasure_coded_pool pool-jerasure
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:56: create_erasure_coded_pool:  local poolname=pool-jerasure
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:58: create_erasure_coded_pool:  ceph osd erasure-code-profile set myprofile plugin=jerasure k=2 m=1 ruleset-failure-domain=osd
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:62: create_erasure_coded_pool:  ceph osd pool create pool-jerasure 1 1 erasure myprofile
pool 'pool-jerasure' created
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:64: create_erasure_coded_pool:  wait_for_clean
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1251: wait_for_clean:  local num_active_clean=-1
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1252: wait_for_clean:  local cur_active_clean
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1253: wait_for_clean:  delays=($(get_timeout_delays $TIMEOUT .1))
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1253: wait_for_clean:  get_timeout_delays 300 .1
///home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1207: get_timeout_delays:  shopt -q -o xtrace
///home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1207: get_timeout_delays:  echo true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1207: get_timeout_delays:  local trace=true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1208: get_timeout_delays:  true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1208: get_timeout_delays:  shopt -u -o xtrace
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1253: wait_for_clean:  local -a delays
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1254: wait_for_clean:  local -i loop=0
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1256: wait_for_clean:  get_num_pgs
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1121: get_num_pgs:  ceph --format json status
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1121: get_num_pgs:  jq .pgmap.num_pgs
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1256: wait_for_clean:  test 4 == 0
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1260: wait_for_clean:  true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1264: wait_for_clean:  get_num_active_clean
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1091: get_num_active_clean:  local expression
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1092: get_num_active_clean:  expression+='select(contains("active") and contains("clean")) | '
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1093: get_num_active_clean:  expression+='select(contains("stale") | not)'
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1094: get_num_active_clean:  ceph --format json pg dump pgs
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1095: get_num_active_clean:  jq '[.[] | .state | select(contains("active") and contains("clean")) | select(contains("stale") | not)] | length'
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1264: wait_for_clean:  cur_active_clean=4
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1265: wait_for_clean:  get_num_pgs
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1121: get_num_pgs:  ceph --format json status
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1121: get_num_pgs:  jq .pgmap.num_pgs
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1265: wait_for_clean:  test 4 = 4
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1265: wait_for_clean:  break
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:1278: wait_for_clean:  return 0
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:264: TEST_rados_get_subread_eio_shard_1:  local shard_id=1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:265: TEST_rados_get_subread_eio_shard_1:  rados_get_data_eio td/test-erasure-eio 1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:161: rados_get_data_eio:  local dir=td/test-erasure-eio
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:162: rados_get_data_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:163: rados_get_data_eio:  local shard_id=1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:164: rados_get_data_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:165: rados_get_data_eio:  local recovery=
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:166: rados_get_data_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:170: rados_get_data_eio:  local poolname=pool-jerasure
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:171: rados_get_data_eio:  local objname=obj-eio-32508-1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:172: rados_get_data_eio:  inject_eio obj-eio-32508-1 td/test-erasure-eio 1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:145: inject_eio:  local objname=obj-eio-32508-1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:146: inject_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:147: inject_eio:  local dir=td/test-erasure-eio
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:148: inject_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:149: inject_eio:  local shard_id=1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:150: inject_eio:  shift
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:152: inject_eio:  local poolname=pool-jerasure
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:153: inject_eio:  initial_osds=($(get_osds $poolname $objname))
//home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:153: inject_eio:  get_osds pool-jerasure obj-eio-32508-1
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:765: get_osds:  local poolname=pool-jerasure
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:766: get_osds:  local objectname=obj-eio-32508-1
///home/dzafman/ceph/qa/workunits/ceph-helpers.sh:769: get_osds:  ceph --format json osd map pool-jerasure obj-eio-32508-1
///home/dzafman/ceph/qa/workunits/ceph-helpers.sh:769: get_osds:  jq '.acting | .[]'
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:769: get_osds:  local 'osds=3
1
2'
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:771: get_osds:  echo 3 1 2
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:153: inject_eio:  local -a initial_osds
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:154: inject_eio:  local osd_id=1
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:155: inject_eio:  set_config osd 1 filestore_debug_inject_read_err true
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:902: set_config:  local daemon=osd
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:903: set_config:  local id=1
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:904: set_config:  local config=filestore_debug_inject_read_err
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:905: set_config:  local value=true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:909: set_config:  env CEPH_ARGS= ceph --format json daemon td/test-erasure-eio/ceph-osd.1.asok config set filestore_debug_inject_read_err true
//home/dzafman/ceph/qa/workunits/ceph-helpers.sh:909: set_config:  jq 'has("success")'
/home/dzafman/ceph/qa/workunits/ceph-helpers.sh:909: set_config:  test true == true
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:156: inject_eio:  CEPH_ARGS=
/home/dzafman/ceph/src/test/erasure-code/test-erasure-eio.sh:156: inject_eio:  ceph --admin-daemon td/test-erasure-eio/ceph-osd.1.asok injectdataerr pool-jerasure obj-eio-32508-1 1
Invalid poolpool-jerasure

Related issues

Related to Ceph - Bug #20921: mon SEGV PerfCounters::tinc() rados/standalone/scrub.yaml shutdown after TEST_scrub_snaps Resolved 08/05/2017
Duplicates RADOS - Bug #20784: rados/standalone/erasure-code.yaml failure Duplicate 07/26/2017
Copied to Ceph - Backport #20979: Luminous: Race seen between pool creation and wait_for_clean(): seen in test-erasure-eio.sh Resolved

History

#1 Updated by Sage Weil over 1 year ago

  • Priority changed from Normal to High

I think the wait function just needs to wait if there are unknown pgs too

#2 Updated by David Zafman over 1 year ago

  • Duplicates Bug #20784: rados/standalone/erasure-code.yaml failure added

#3 Updated by David Zafman over 1 year ago

  • Status changed from New to Duplicate

#4 Updated by David Zafman over 1 year ago

  • Status changed from Duplicate to In Progress
  • Priority changed from High to Urgent

#5 Updated by David Zafman over 1 year ago

  • Status changed from In Progress to Pending Backport
  • Backport set to luminous

https://github.com/ceph/ceph/pull/16709

This pull request includes the fix for 20921. We should backport the entire branch (17 commits).

#6 Updated by David Zafman over 1 year ago

  • Related to Bug #20921: mon SEGV PerfCounters::tinc() rados/standalone/scrub.yaml shutdown after TEST_scrub_snaps added

#7 Updated by David Zafman over 1 year ago

  • Copied to Backport #20979: Luminous: Race seen between pool creation and wait_for_clean(): seen in test-erasure-eio.sh added

#8 Updated by Kefu Chai about 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF