Project

General

Profile

Bug #57883

test-erasure-code.sh: TEST_rados_put_get_jerasure fails on "rados_put_get: grep '\<5\>'"

Added by Laura Flores over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2022-10-13_17:24:48-rados-main-distro-default-smithi/7065580

2022-10-14T10:49:22.699 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-code.sh:91: rados_put_get:  get_osds ecpool SOMETHING
2022-10-14T10:49:22.700 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1039: get_osds:  local poolname=ecpool
2022-10-14T10:49:22.700 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1040: get_osds:  local objectname=SOMETHING
2022-10-14T10:49:22.701 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-code.sh:91: rados_put_get:  grep '\<5\>'
2022-10-14T10:49:22.701 INFO:tasks.workunit.client.0.smithi187.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1043: get_osds:  ceph --format json osd map ecpool SOMETHING
2022-10-14T10:49:22.702 INFO:tasks.workunit.client.0.smithi187.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1043: get_osds:  jq '.acting | .[]'
2022-10-14T10:49:23.456 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1043: get_osds:  local 'osds=6
2022-10-14T10:49:23.457 INFO:tasks.workunit.client.0.smithi187.stderr:0
2022-10-14T10:49:23.457 INFO:tasks.workunit.client.0.smithi187.stderr:7
2022-10-14T10:49:23.457 INFO:tasks.workunit.client.0.smithi187.stderr:5'
2022-10-14T10:49:23.458 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1045: get_osds:  echo 6 0 7 5
2022-10-14T10:49:23.458 INFO:tasks.workunit.client.0.smithi187.stdout:6 0 7 5
2022-10-14T10:49:23.460 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-code.sh:91: rados_put_get:  return 1
2022-10-14T10:49:23.461 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-code.sh:210: TEST_rados_put_get_jerasure:  return 1

This test marks an OSD down and checks to make sure that it's not in the acting set. In this case, the OSD that was marked down was still in the acting set.

OSD.5 was marked down here in the mon.a log:

2022-10-14T10:49:22.045+0000 7fad73819700  5 --2- v2:127.0.0.1:7101/0 >> 127.0.0.1:0/3515117041 conn(0x560849b97800 0x5608490d4c00 crc :-1 s=READ_MESSAGE_COMPLETE pgs=1 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_message received message m=0x5608498f5680 seq=6 from=client.? type=50 mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1
2022-10-14T10:49:22.045+0000 7fad73819700 20 -- v2:127.0.0.1:7101/0 queue 0x5608498f5680 prio 127
2022-10-14T10:49:22.045+0000 7fad73819700 20 --2- v2:127.0.0.1:7101/0 >> 127.0.0.1:0/3515117041 conn(0x560849b97800 0x5608490d4c00 crc :-1 s=READY pgs=1 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).read_frame
2022-10-14T10:49:22.045+0000 7fad73819700 20 -- v2:127.0.0.1:7101/0 >> 127.0.0.1:0/3515117041 conn(0x560849b97800 msgr2=0x5608490d4c00 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read start len=32
2022-10-14T10:49:22.045+0000 7fad7581d700  1 -- v2:127.0.0.1:7101/0 <== client.? 127.0.0.1:0/3515117041 6 ==== mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1 ==== 77+0+0 (crc 0 0 0) 0x5608498f5680 con 0x560849b97800
2022-10-14T10:49:22.045+0000 7fad7581d700 20 mon.a@0(leader) e1 _ms_dispatch existing session 0x56084a17fb00 for client.?
2022-10-14T10:49:22.045+0000 7fad7581d700 20 mon.a@0(leader) e1  entity_name client.admin global_id 5021 (none) caps allow *
2022-10-14T10:49:22.045+0000 7fad7581d700  0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1
2022-10-14T10:49:22.045+0000 7fad7581d700 20 is_capable service=osd command=osd out read write addr 127.0.0.1:0/3515117041 on cap allow *
2022-10-14T10:49:22.045+0000 7fad7581d700 20  allow so far , doing grant allow *
2022-10-14T10:49:22.045+0000 7fad7581d700 20  allow all
2022-10-14T10:49:22.045+0000 7fad7581d700 10 mon.a@0(leader) e1 _allowed_command capable
2022-10-14T10:49:22.045+0000 7fad7581d700  0 log_channel(audit) log [INF] : from='client.? 127.0.0.1:0/3515117041' entity='client.admin' cmd=[{"prefix": "osd out", "ids": ["5"]}]: dispatch
2022-10-14T10:49:22.045+0000 7fad7581d700  1 -- v2:127.0.0.1:7101/0 --> v2:127.0.0.1:7101/0 -- log(1 entries from seq 1295 at 2022-10-14T10:49:22.050967+0000) v1 -- 0x560849a548c0 con 0x560847ca5800
2022-10-14T10:49:22.045+0000 7fad7581d700 20 -- v2:127.0.0.1:7101/0 >> v2:127.0.0.1:7101/0 conn(0x560847ca5800 msgr2=0x560847f7c800 unknown :-1 s=STATE_NONE l=0).send_message log(1 entries from seq 1295 at 2022-10-14T10:49:22.050967+0000) v1 local
2022-10-14T10:49:22.045+0000 7fad7501c700 20 -- v2:127.0.0.1:7101/0 queue 0x560849a548c0 prio 196
2022-10-14T10:49:22.045+0000 7fad7581d700 10 mon.a@0(leader).paxosservice(osdmap 1..126) dispatch 0x5608498f5680 mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1 from client.? 127.0.0.1:0/3515117041 con 0x560849b97800
2022-10-14T10:49:22.045+0000 7fad7581d700  5 mon.a@0(leader).paxos(paxos active c 754..1417) is_readable = 1 - now=2022-10-14T10:49:22.051040+0000 lease_expire=1970-01-01T00:00:00.000000+0000 has v0 lc 1417
2022-10-14T10:49:22.045+0000 7fad7581d700 10 mon.a@0(leader).osd e126 preprocess_query mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1 from client.? 127.0.0.1:0/3515117041
2022-10-14T10:49:22.045+0000 7fad7581d700  7 mon.a@0(leader).osd e126 prepare_update mon_command({"prefix": "osd out", "ids": ["5"]} v 0) v1 from client.? 127.0.0.1:0/3515117041
2022-10-14T10:49:22.045+0000 7fad7581d700 -1 mon.a@0(leader).osd e126 definitely_dead 0
2022-10-14T10:49:22.045+0000 7fad7581d700  0 log_channel(cluster) log [INF] : Client client.admin marked osd.5 out, while it was still marked up

When we marked osd.5 down in the script, we got:

2022-10-14T10:49:21.756 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/erasure-code/test-erasure-code.sh:90: rados_put_get:  ceph osd out 5
2022-10-14T10:49:22.685 INFO:tasks.workunit.client.0.smithi187.stderr:osd.5 is already out.

The acting set (which we later use to check whether the down OSD is there or not) was established here:

2022-10-14T10:49:23.458 INFO:tasks.workunit.client.0.smithi187.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:1045: get_osds:  echo 6 0 7 5
2022-10-14T10:49:23.458 INFO:tasks.workunit.client.0.smithi187.stdout:6 0 7 5


Related issues

Related to Ceph - Bug #15211: tests: test-erasure-code.sh failing sometimes during 'make check' on master Resolved 03/18/2016

History

#1 Updated by Laura Flores over 1 year ago

  • Related to Bug #15211: tests: test-erasure-code.sh failing sometimes during 'make check' on master added

#2 Updated by Laura Flores over 1 year ago

  • Tags set to test-failure

#3 Updated by Laura Flores over 1 year ago

The failed function:

qa/standalone/erasure-code/test-erasure-code.sh

function rados_put_get() {
    local dir=$1
    local poolname=$2
    local objname=${3:-SOMETHING}

    for marker in AAA BBB CCCC DDDD ; do
        printf "%*s" 1024 $marker
    done > $dir/ORIGINAL

    #
    # get and put an object, compare they are equal
    #
    rados --pool $poolname put $objname $dir/ORIGINAL || return 1
    rados --pool $poolname get $objname $dir/COPY || return 1
    diff $dir/ORIGINAL $dir/COPY || return 1
    rm $dir/COPY

    #
    # take out an OSD used to store the object and
    # check the object can still be retrieved, which implies
    # recovery
    #
    local -a initial_osds=($(get_osds $poolname $objname))
    local last=$((${#initial_osds[@]} - 1))
    ceph osd out ${initial_osds[$last]} || return 1
    ! get_osds $poolname $objname | grep '\<'${initial_osds[$last]}'\>' || return 1
    rados --pool $poolname get $objname $dir/COPY || return 1
    diff $dir/ORIGINAL $dir/COPY || return 1
    ceph osd in ${initial_osds[$last]} || return 1

    rm $dir/ORIGINAL
}

#4 Updated by Laura Flores over 1 year ago

  • Assignee set to Laura Flores

#6 Updated by Radoslaw Zarzynski over 1 year ago

It looks we asked for taking osd.5 down, got a confirmation the command was handled by mon and then get_osd said 5 is still in the acting set.

function rados_put_get() {
    # ...
    ceph osd out ${initial_osds[$last]} || return 1
    ! get_osds $poolname $objname | grep '\<'${initial_osds[$last]}'\>' || return 1

function get_osds() {
    local poolname=$1
    local objectname=$2

    local osds=$(ceph --format json osd map $poolname $objectname 2>/dev/null | \
        jq '.acting | .[]')
    # get rid of the trailing space
    echo $osds
}

The problem here might be that get_osds were using the older osdmap. Actually, I'm not sure the osd down command waits (synchronizes) for new map propagation. If not, this would be a test issue – a loop ensuring new osdmap is used would be needed.

#7 Updated by Laura Flores over 1 year ago

  • Status changed from New to In Progress

#8 Updated by Laura Flores over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 48619

#9 Updated by Laura Flores over 1 year ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF