Bug #13986: TEST_scrub_snaps: timeout > 300 - Ceph - Ceph

Actions

Copy link

Bug #13986

closed

TEST_scrub_snaps: timeout > 300

Added by Loïc Dachary over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Loïc Dachary

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9864/consoleFull

.../qa/workunits/ceph-helpers.sh:874: get_num_active_clean:  xmlstarlet sel -t -m '//pg_stat/state[(contains(.,'\''active'\'') and contains(.,'\''clean'\'') and not(contains(.,'\''stale'\'')))]' -v . -n
.../qa/workunits/ceph-helpers.sh:873: get_num_active_clean:  ceph --format xml pg dump pgs
../qa/workunits/ceph-helpers.sh:956: is_clean:  test 5 = 5
../qa/workunits/ceph-helpers.sh:1003: wait_for_clean:  return 0
./test/osd/osd-scrub-snaps.sh:153: TEST_scrub_snaps:  local pgid=2.0
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps:  pg_scrub 2.0
../qa/workunits/ceph-helpers.sh:1059: pg_scrub:  local pgid=2.0
.../qa/workunits/ceph-helpers.sh:1060: pg_scrub:  get_last_scrub_stamp 2.0
.../qa/workunits/ceph-helpers.sh:928: get_last_scrub_stamp:  local pgid=2.0
.../qa/workunits/ceph-helpers.sh:929: get_last_scrub_stamp:  ceph --format xml pg dump pgs
.../qa/workunits/ceph-helpers.sh:930: get_last_scrub_stamp:  xmlstarlet sel -t -m '//pg_stat[pgid='\''2.0'\'']/last_scrub_stamp' -v .
../qa/workunits/ceph-helpers.sh:1060: pg_scrub:  local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1061: pg_scrub:  ceph pg scrub 2.0
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
instructing pg 2.0 on osd.0 to scrub
../qa/workunits/ceph-helpers.sh:1062: pg_scrub:  wait_for_scrub 2.0 '2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1142: wait_for_scrub:  local pgid=2.0
../qa/workunits/ceph-helpers.sh:1143: wait_for_scrub:  local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i=0 ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i < 300 ))
.../qa/workunits/ceph-helpers.sh:1146: wait_for_scrub:  get_last_scrub_stamp 2.0
...
../qa/workunits/ceph-helpers.sh:1149: wait_for_scrub:  sleep 1
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i++ ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i < 300 ))
../qa/workunits/ceph-helpers.sh:1151: wait_for_scrub:  return 1
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps:  return 1
./test/osd/osd-scrub-snaps.sh:30: run:  return 1
../qa/workunits/ceph-helpers.sh:1242: main:  code=1
../qa/workunits/ceph-helpers.sh:1244: main:  teardown testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:117: teardown:  local dir=testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:118: teardown:  kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons:  local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons:  return 0
.../qa/workunits/ceph-helpers.sh:119: teardown:  stat -f -c %T .
../qa/workunits/ceph-helpers.sh:119: teardown:  '[' ext2/ext3 == btrfs ']'
../qa/workunits/ceph-helpers.sh:122: teardown:  rm -fr testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:1245: main:  return 1

Files

Download all files

test.out.gz (115 KB) test.out.gz		Loïc Dachary, 12/08/2015 04:56 PM
test2.out.gz (114 KB) test2.out.gz		Loïc Dachary, 12/09/2015 01:07 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04&&x86_64/9932/consoleFull

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9966/console

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

https://github.com/ceph/ceph/pull/6857 to collect more info about the possible race

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Running

while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done

on rex003

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Status changed from Need More Info to Duplicate

#13988

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Is duplicate of Bug #13988: new OSD re-using old OSD id fails to boot added

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Is duplicate of deleted (Bug #13988: new OSD re-using old OSD id fails to boot)

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

File test.out.gz test.out.gz added

Actions

Copy link

Updated by Loïc Dachary over 8 years ago

Status changed from Duplicate to Fix Under Review
Assignee set to Loïc Dachary

Actions

Copy link

#10

Updated by Loïc Dachary over 8 years ago

https://github.com/ceph/ceph/pull/6860

Actions

Copy link

#11

Updated by Loïc Dachary over 8 years ago

The relevant lines of the osd logs are:

2015-12-08 14:45:53.349784 7f3a6c534700 20 osd.0 19 scrub_load_below_threshold loadavg 32.23 >= max 0.5 and ( >= daily_loadavg 13.3244 or >= 15m avg 13.39) = no
2015-12-08 14:45:53.349806 7f3a6c534700 20 osd.0 19 sched_scrub load_is_low=0
2015-12-08 14:45:53.349809 7f3a6c534700 10 osd.0 19 sched_scrub 2.0 schedued at 2015-12-09 19:55:20.323080 > 2015-12-08 14:45:53.349745

Actions

Copy link

#12

Updated by Loïc Dachary over 8 years ago

File test2.out.gz test2.out.gz added

Actions

Copy link

#13

Updated by Loïc Dachary over 8 years ago

crash with monitor logs and --osd-scrub-load-threshold=2000 http://dachary.org/loic/crash.gz

Actions

Copy link

#14

Updated by Loïc Dachary over 8 years ago

Related to Bug #14027: tests: set --osd-scrub-load-threshold=2000 added

Actions

Copy link

#15

Updated by Loïc Dachary over 8 years ago

Status changed from Fix Under Review to In Progress

Actions

Copy link

#16

Updated by Loïc Dachary over 8 years ago

To reproduce the problem:

in the background run

while : ; do echo -n . ; ./run-make-check.sh > run.out 2>&1 ; done

on master, after a ./run-make-check.sh

cd src ; while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done

Using the commit from https://github.com/ceph/ceph/pull/6860 to display the monitor and osd log, test.out will have all the necessary diagnostic information.

Actions

Copy link

#17

Updated by Loïc Dachary over 8 years ago

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10041/console

Actions

Copy link

#18

Updated by Loïc Dachary over 8 years ago

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10048/console

it has a new kind of error log. The kill_daemon function returned on error, meaning it was unable to kill the daemons before the timeout (~3 minutes). I've never seen this but it's not impossible if the machine is heavily overloaded.

./qa/workunits/ceph-helpers.sh:1147: wait_for_scrub:  test '2015-12-10 11:22:18.817416' '!=' '2015-12-10 11:23:06.796504'
../qa/workunits/ceph-helpers.sh:1148: wait_for_scrub:  return 0
./test/osd/osd-scrub-snaps.sh:158: TEST_scrub_snaps:  grep log_channel testdir/osd-scrub-snaps/osd.0.log
2015-12-10 11:23:06.791343 7f7b7e136700  0 log_channel(cluster) log [INF] : 2.0 scrub starts
2015-12-10 11:23:06.794271 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/2acecc8b/obj10/1 is missing in clone_overlap
2015-12-10 11:23:06.794328 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 no '_' attr
2015-12-10 11:23:06.794336 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 is an unexpected clone
2015-12-10 11:23:06.794377 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/4 on disk size (4608) does not match object info size (512) adjusted for ondisk to (512)
2015-12-10 11:23:06.794443 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/2
2015-12-10 11:23:06.794449 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/1
2015-12-10 11:23:06.794454 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/666934a3/obj5/head 2 missing clone(s)
2015-12-10 11:23:06.794481 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/d3a9faf5/obj12/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794637 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/8df7eaa5/obj8/head snaps.seq not set
2015-12-10 11:23:06.794717 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794741 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/1 is an unexpected clone
2015-12-10 11:23:06.794750 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/61f68bb1/obj3/head on disk size (3840) does not match object info size (768) adjusted for ondisk to (768)
2015-12-10 11:23:06.794900 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/83425cc4/obj6/1 is an unexpected clone
2015-12-10 11:23:06.794928 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/3f1ee208/obj2/snapdir no 'snapset' attr
2015-12-10 11:23:06.794948 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/7 clone ignored due to missing snapset
2015-12-10 11:23:06.794967 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/4 clone ignored due to missing snapset
2015-12-10 11:23:06.795017 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/a8759770/obj4/snapdir expected clone 2/a8759770/obj4/7
2015-12-10 11:23:06.795023 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/a8759770/obj4/snapdir 1 missing clone(s)
2015-12-10 11:23:06.796212 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/6cf8deff/obj1/1 is an unexpected clone
2015-12-10 11:23:06.796283 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/e478ac7f/obj9/1 is missing in clone_size
2015-12-10 11:23:06.796337 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/29547577/obj11/1 is an unexpected clone
2015-12-10 11:23:06.796412 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/94122507/obj14/1 size 1032 != clone_size 1033
2015-12-10 11:23:06.796482 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub stat mismatch, got 21/31 objects, 10/19 clones, 21/31 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7681/13064 bytes, 0/0 hit_set_archive bytes.
2015-12-10 11:23:06.796500 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub 19 errors
../test/osd/osd-scrub-snaps.sh:34: TEST_scrub_snaps:  seq 1 7
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap1
removed pool test snap snap1
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap2
removed pool test snap snap2
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap3
removed pool test snap snap3
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap4
removed pool test snap snap4
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap5
removed pool test snap snap5
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap6
removed pool test snap snap6
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap7
removed pool test snap snap7
./test/osd/osd-scrub-snaps.sh:165: TEST_scrub_snaps:  ERRORS=0
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  find testdir/osd-scrub-snaps
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  grep '[^/]*\.pid'
./test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  pidfile='testdir/osd-scrub-snaps/mon.a.pid
testdir/osd-scrub-snaps/osd.0.pid'
../test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps:  cat testdir/osd-scrub-snaps/mon.a.pid testdir/osd-scrub-snaps/osd.0.pid
./test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps:  pid='32405
16134'
./test/osd/osd-scrub-snaps.sh:169: TEST_scrub_snaps:  kill -0 32405 16134
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps:  kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons:  local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons:  return 1
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps:  return 1

Actions

Copy link

#19