Bug #13986
closed
TEST_scrub_snaps: timeout > 300
Added by Loïc Dachary over 8 years ago.
Updated over 8 years ago.
Description
http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9864/consoleFull
.../qa/workunits/ceph-helpers.sh:874: get_num_active_clean: xmlstarlet sel -t -m '//pg_stat/state[(contains(.,'\''active'\'') and contains(.,'\''clean'\'') and not(contains(.,'\''stale'\'')))]' -v . -n
.../qa/workunits/ceph-helpers.sh:873: get_num_active_clean: ceph --format xml pg dump pgs
../qa/workunits/ceph-helpers.sh:956: is_clean: test 5 = 5
../qa/workunits/ceph-helpers.sh:1003: wait_for_clean: return 0
./test/osd/osd-scrub-snaps.sh:153: TEST_scrub_snaps: local pgid=2.0
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps: pg_scrub 2.0
../qa/workunits/ceph-helpers.sh:1059: pg_scrub: local pgid=2.0
.../qa/workunits/ceph-helpers.sh:1060: pg_scrub: get_last_scrub_stamp 2.0
.../qa/workunits/ceph-helpers.sh:928: get_last_scrub_stamp: local pgid=2.0
.../qa/workunits/ceph-helpers.sh:929: get_last_scrub_stamp: ceph --format xml pg dump pgs
.../qa/workunits/ceph-helpers.sh:930: get_last_scrub_stamp: xmlstarlet sel -t -m '//pg_stat[pgid='\''2.0'\'']/last_scrub_stamp' -v .
../qa/workunits/ceph-helpers.sh:1060: pg_scrub: local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1061: pg_scrub: ceph pg scrub 2.0
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
instructing pg 2.0 on osd.0 to scrub
../qa/workunits/ceph-helpers.sh:1062: pg_scrub: wait_for_scrub 2.0 '2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1142: wait_for_scrub: local pgid=2.0
../qa/workunits/ceph-helpers.sh:1143: wait_for_scrub: local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i=0 ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i < 300 ))
.../qa/workunits/ceph-helpers.sh:1146: wait_for_scrub: get_last_scrub_stamp 2.0
...
../qa/workunits/ceph-helpers.sh:1149: wait_for_scrub: sleep 1
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i++ ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i < 300 ))
../qa/workunits/ceph-helpers.sh:1151: wait_for_scrub: return 1
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps: return 1
./test/osd/osd-scrub-snaps.sh:30: run: return 1
../qa/workunits/ceph-helpers.sh:1242: main: code=1
../qa/workunits/ceph-helpers.sh:1244: main: teardown testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:117: teardown: local dir=testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:118: teardown: kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons: shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons: echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons: local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons: true
../qa/workunits/ceph-helpers.sh:193: kill_daemons: shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons: return 0
.../qa/workunits/ceph-helpers.sh:119: teardown: stat -f -c %T .
../qa/workunits/ceph-helpers.sh:119: teardown: '[' ext2/ext3 == btrfs ']'
../qa/workunits/ceph-helpers.sh:122: teardown: rm -fr testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:1245: main: return 1
Files
Running
while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done
on rex003
- Status changed from Need More Info to Duplicate
- Is duplicate of Bug #13988: new OSD re-using old OSD id fails to boot added
- Is duplicate of deleted (Bug #13988: new OSD re-using old OSD id fails to boot)
- Status changed from Duplicate to Fix Under Review
- Assignee set to Loïc Dachary
The relevant lines of the osd logs are:
2015-12-08 14:45:53.349784 7f3a6c534700 20 osd.0 19 scrub_load_below_threshold loadavg 32.23 >= max 0.5 and ( >= daily_loadavg 13.3244 or >= 15m avg 13.39) = no
2015-12-08 14:45:53.349806 7f3a6c534700 20 osd.0 19 sched_scrub load_is_low=0
2015-12-08 14:45:53.349809 7f3a6c534700 10 osd.0 19 sched_scrub 2.0 schedued at 2015-12-09 19:55:20.323080 > 2015-12-08 14:45:53.349745
- Related to Bug #14027: tests: set --osd-scrub-load-threshold=2000 added
- Status changed from Fix Under Review to In Progress
http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10048/console
it has a new kind of error log. The kill_daemon function returned on error, meaning it was unable to kill the daemons before the timeout (~3 minutes). I've never seen this but it's not impossible if the machine is heavily overloaded.
./qa/workunits/ceph-helpers.sh:1147: wait_for_scrub: test '2015-12-10 11:22:18.817416' '!=' '2015-12-10 11:23:06.796504'
../qa/workunits/ceph-helpers.sh:1148: wait_for_scrub: return 0
./test/osd/osd-scrub-snaps.sh:158: TEST_scrub_snaps: grep log_channel testdir/osd-scrub-snaps/osd.0.log
2015-12-10 11:23:06.791343 7f7b7e136700 0 log_channel(cluster) log [INF] : 2.0 scrub starts
2015-12-10 11:23:06.794271 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/2acecc8b/obj10/1 is missing in clone_overlap
2015-12-10 11:23:06.794328 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 no '_' attr
2015-12-10 11:23:06.794336 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 is an unexpected clone
2015-12-10 11:23:06.794377 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/4 on disk size (4608) does not match object info size (512) adjusted for ondisk to (512)
2015-12-10 11:23:06.794443 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/2
2015-12-10 11:23:06.794449 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/1
2015-12-10 11:23:06.794454 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/666934a3/obj5/head 2 missing clone(s)
2015-12-10 11:23:06.794481 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/d3a9faf5/obj12/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794637 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/8df7eaa5/obj8/head snaps.seq not set
2015-12-10 11:23:06.794717 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794741 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/1 is an unexpected clone
2015-12-10 11:23:06.794750 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/61f68bb1/obj3/head on disk size (3840) does not match object info size (768) adjusted for ondisk to (768)
2015-12-10 11:23:06.794900 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/83425cc4/obj6/1 is an unexpected clone
2015-12-10 11:23:06.794928 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/3f1ee208/obj2/snapdir no 'snapset' attr
2015-12-10 11:23:06.794948 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/7 clone ignored due to missing snapset
2015-12-10 11:23:06.794967 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/4 clone ignored due to missing snapset
2015-12-10 11:23:06.795017 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/a8759770/obj4/snapdir expected clone 2/a8759770/obj4/7
2015-12-10 11:23:06.795023 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/a8759770/obj4/snapdir 1 missing clone(s)
2015-12-10 11:23:06.796212 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/6cf8deff/obj1/1 is an unexpected clone
2015-12-10 11:23:06.796283 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/e478ac7f/obj9/1 is missing in clone_size
2015-12-10 11:23:06.796337 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/29547577/obj11/1 is an unexpected clone
2015-12-10 11:23:06.796412 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/94122507/obj14/1 size 1032 != clone_size 1033
2015-12-10 11:23:06.796482 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub stat mismatch, got 21/31 objects, 10/19 clones, 21/31 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7681/13064 bytes, 0/0 hit_set_archive bytes.
2015-12-10 11:23:06.796500 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub 19 errors
../test/osd/osd-scrub-snaps.sh:34: TEST_scrub_snaps: seq 1 7
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap1
removed pool test snap snap1
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap2
removed pool test snap snap2
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap3
removed pool test snap snap3
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap4
removed pool test snap snap4
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap5
removed pool test snap snap5
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap6
removed pool test snap snap6
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap7
removed pool test snap snap7
./test/osd/osd-scrub-snaps.sh:165: TEST_scrub_snaps: ERRORS=0
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: find testdir/osd-scrub-snaps
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: grep '[^/]*\.pid'
./test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: pidfile='testdir/osd-scrub-snaps/mon.a.pid
testdir/osd-scrub-snaps/osd.0.pid'
../test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps: cat testdir/osd-scrub-snaps/mon.a.pid testdir/osd-scrub-snaps/osd.0.pid
./test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps: pid='32405
16134'
./test/osd/osd-scrub-snaps.sh:169: TEST_scrub_snaps: kill -0 32405 16134
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps: kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons: shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons: echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons: local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons: true
../qa/workunits/ceph-helpers.sh:193: kill_daemons: shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons: return 1
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps: return 1
@david do you have an opinion on the error log above ? It's the first time it happens.
- Status changed from In Progress to Resolved
Also available in: Atom
PDF