Bug #13986
closedTEST_scrub_snaps: timeout > 300
0%
Description
http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9864/consoleFull
.../qa/workunits/ceph-helpers.sh:874: get_num_active_clean: xmlstarlet sel -t -m '//pg_stat/state[(contains(.,'\''active'\'') and contains(.,'\''clean'\'') and not(contains(.,'\''stale'\'')))]' -v . -n .../qa/workunits/ceph-helpers.sh:873: get_num_active_clean: ceph --format xml pg dump pgs ../qa/workunits/ceph-helpers.sh:956: is_clean: test 5 = 5 ../qa/workunits/ceph-helpers.sh:1003: wait_for_clean: return 0 ./test/osd/osd-scrub-snaps.sh:153: TEST_scrub_snaps: local pgid=2.0 ./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps: pg_scrub 2.0 ../qa/workunits/ceph-helpers.sh:1059: pg_scrub: local pgid=2.0 .../qa/workunits/ceph-helpers.sh:1060: pg_scrub: get_last_scrub_stamp 2.0 .../qa/workunits/ceph-helpers.sh:928: get_last_scrub_stamp: local pgid=2.0 .../qa/workunits/ceph-helpers.sh:929: get_last_scrub_stamp: ceph --format xml pg dump pgs .../qa/workunits/ceph-helpers.sh:930: get_last_scrub_stamp: xmlstarlet sel -t -m '//pg_stat[pgid='\''2.0'\'']/last_scrub_stamp' -v . ../qa/workunits/ceph-helpers.sh:1060: pg_scrub: local 'last_scrub=2015-12-05 03:06:17.448168' ../qa/workunits/ceph-helpers.sh:1061: pg_scrub: ceph pg scrub 2.0 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** instructing pg 2.0 on osd.0 to scrub ../qa/workunits/ceph-helpers.sh:1062: pg_scrub: wait_for_scrub 2.0 '2015-12-05 03:06:17.448168' ../qa/workunits/ceph-helpers.sh:1142: wait_for_scrub: local pgid=2.0 ../qa/workunits/ceph-helpers.sh:1143: wait_for_scrub: local 'last_scrub=2015-12-05 03:06:17.448168' ../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i=0 )) ../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i < 300 )) .../qa/workunits/ceph-helpers.sh:1146: wait_for_scrub: get_last_scrub_stamp 2.0 ... ../qa/workunits/ceph-helpers.sh:1149: wait_for_scrub: sleep 1 ../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i++ )) ../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub: (( i < 300 )) ../qa/workunits/ceph-helpers.sh:1151: wait_for_scrub: return 1 ./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps: return 1 ./test/osd/osd-scrub-snaps.sh:30: run: return 1 ../qa/workunits/ceph-helpers.sh:1242: main: code=1 ../qa/workunits/ceph-helpers.sh:1244: main: teardown testdir/osd-scrub-snaps ../qa/workunits/ceph-helpers.sh:117: teardown: local dir=testdir/osd-scrub-snaps ../qa/workunits/ceph-helpers.sh:118: teardown: kill_daemons testdir/osd-scrub-snaps .../qa/workunits/ceph-helpers.sh:192: kill_daemons: shopt -q -o xtrace .../qa/workunits/ceph-helpers.sh:192: kill_daemons: echo true ../qa/workunits/ceph-helpers.sh:192: kill_daemons: local trace=true ../qa/workunits/ceph-helpers.sh:193: kill_daemons: true ../qa/workunits/ceph-helpers.sh:193: kill_daemons: shopt -u -o xtrace ../qa/workunits/ceph-helpers.sh:219: kill_daemons: return 0 .../qa/workunits/ceph-helpers.sh:119: teardown: stat -f -c %T . ../qa/workunits/ceph-helpers.sh:119: teardown: '[' ext2/ext3 == btrfs ']' ../qa/workunits/ceph-helpers.sh:122: teardown: rm -fr testdir/osd-scrub-snaps ../qa/workunits/ceph-helpers.sh:1245: main: return 1
Files
Updated by Loïc Dachary over 8 years ago
https://github.com/ceph/ceph/pull/6857 to collect more info about the possible race
Updated by Loïc Dachary over 8 years ago
Running
while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done
on rex003
Updated by Loïc Dachary over 8 years ago
- Status changed from Need More Info to Duplicate
Updated by Loïc Dachary over 8 years ago
- Is duplicate of Bug #13988: new OSD re-using old OSD id fails to boot added
Updated by Loïc Dachary over 8 years ago
- Is duplicate of deleted (Bug #13988: new OSD re-using old OSD id fails to boot)
Updated by Loïc Dachary over 8 years ago
- Status changed from Duplicate to Fix Under Review
- Assignee set to Loïc Dachary
Updated by Loïc Dachary over 8 years ago
Updated by Loïc Dachary over 8 years ago
The relevant lines of the osd logs are:
2015-12-08 14:45:53.349784 7f3a6c534700 20 osd.0 19 scrub_load_below_threshold loadavg 32.23 >= max 0.5 and ( >= daily_loadavg 13.3244 or >= 15m avg 13.39) = no 2015-12-08 14:45:53.349806 7f3a6c534700 20 osd.0 19 sched_scrub load_is_low=0 2015-12-08 14:45:53.349809 7f3a6c534700 10 osd.0 19 sched_scrub 2.0 schedued at 2015-12-09 19:55:20.323080 > 2015-12-08 14:45:53.349745
Updated by Loïc Dachary over 8 years ago
crash with monitor logs and --osd-scrub-load-threshold=2000 http://dachary.org/loic/crash.gz
Updated by Loïc Dachary over 8 years ago
- Related to Bug #14027: tests: set --osd-scrub-load-threshold=2000 added
Updated by Loïc Dachary over 8 years ago
- Status changed from Fix Under Review to In Progress
Updated by Loïc Dachary over 8 years ago
To reproduce the problem:
- in the background run
while : ; do echo -n . ; ./run-make-check.sh > run.out 2>&1 ; done
- on master, after a ./run-make-check.sh
cd src ; while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done
Using the commit from https://github.com/ceph/ceph/pull/6860 to display the monitor and osd log, test.out will have all the necessary diagnostic information.
Updated by Loïc Dachary over 8 years ago
http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10048/console
it has a new kind of error log. The kill_daemon function returned on error, meaning it was unable to kill the daemons before the timeout (~3 minutes). I've never seen this but it's not impossible if the machine is heavily overloaded.
./qa/workunits/ceph-helpers.sh:1147: wait_for_scrub: test '2015-12-10 11:22:18.817416' '!=' '2015-12-10 11:23:06.796504' ../qa/workunits/ceph-helpers.sh:1148: wait_for_scrub: return 0 ./test/osd/osd-scrub-snaps.sh:158: TEST_scrub_snaps: grep log_channel testdir/osd-scrub-snaps/osd.0.log 2015-12-10 11:23:06.791343 7f7b7e136700 0 log_channel(cluster) log [INF] : 2.0 scrub starts 2015-12-10 11:23:06.794271 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/2acecc8b/obj10/1 is missing in clone_overlap 2015-12-10 11:23:06.794328 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 no '_' attr 2015-12-10 11:23:06.794336 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 is an unexpected clone 2015-12-10 11:23:06.794377 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/4 on disk size (4608) does not match object info size (512) adjusted for ondisk to (512) 2015-12-10 11:23:06.794443 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/2 2015-12-10 11:23:06.794449 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/1 2015-12-10 11:23:06.794454 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/666934a3/obj5/head 2 missing clone(s) 2015-12-10 11:23:06.794481 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/d3a9faf5/obj12/head snapset.head_exists=false, but head exists 2015-12-10 11:23:06.794637 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/8df7eaa5/obj8/head snaps.seq not set 2015-12-10 11:23:06.794717 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/head snapset.head_exists=false, but head exists 2015-12-10 11:23:06.794741 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/1 is an unexpected clone 2015-12-10 11:23:06.794750 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/61f68bb1/obj3/head on disk size (3840) does not match object info size (768) adjusted for ondisk to (768) 2015-12-10 11:23:06.794900 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/83425cc4/obj6/1 is an unexpected clone 2015-12-10 11:23:06.794928 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/3f1ee208/obj2/snapdir no 'snapset' attr 2015-12-10 11:23:06.794948 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/7 clone ignored due to missing snapset 2015-12-10 11:23:06.794967 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/4 clone ignored due to missing snapset 2015-12-10 11:23:06.795017 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/a8759770/obj4/snapdir expected clone 2/a8759770/obj4/7 2015-12-10 11:23:06.795023 7f7b7e136700 0 log_channel(cluster) log [INF] : scrub 2.0 2/a8759770/obj4/snapdir 1 missing clone(s) 2015-12-10 11:23:06.796212 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/6cf8deff/obj1/1 is an unexpected clone 2015-12-10 11:23:06.796283 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/e478ac7f/obj9/1 is missing in clone_size 2015-12-10 11:23:06.796337 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/29547577/obj11/1 is an unexpected clone 2015-12-10 11:23:06.796412 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/94122507/obj14/1 size 1032 != clone_size 1033 2015-12-10 11:23:06.796482 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub stat mismatch, got 21/31 objects, 10/19 clones, 21/31 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7681/13064 bytes, 0/0 hit_set_archive bytes. 2015-12-10 11:23:06.796500 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub 19 errors ../test/osd/osd-scrub-snaps.sh:34: TEST_scrub_snaps: seq 1 7 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap1 removed pool test snap snap1 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap2 removed pool test snap snap2 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap3 removed pool test snap snap3 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap4 removed pool test snap snap4 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap5 removed pool test snap snap5 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap6 removed pool test snap snap6 ./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps: for i in '`seq 1 7`' ./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps: rados -p test rmsnap snap7 removed pool test snap snap7 ./test/osd/osd-scrub-snaps.sh:165: TEST_scrub_snaps: ERRORS=0 ../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: find testdir/osd-scrub-snaps ../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: grep '[^/]*\.pid' ./test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps: pidfile='testdir/osd-scrub-snaps/mon.a.pid testdir/osd-scrub-snaps/osd.0.pid' ../test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps: cat testdir/osd-scrub-snaps/mon.a.pid testdir/osd-scrub-snaps/osd.0.pid ./test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps: pid='32405 16134' ./test/osd/osd-scrub-snaps.sh:169: TEST_scrub_snaps: kill -0 32405 16134 ./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps: kill_daemons testdir/osd-scrub-snaps .../qa/workunits/ceph-helpers.sh:192: kill_daemons: shopt -q -o xtrace .../qa/workunits/ceph-helpers.sh:192: kill_daemons: echo true ../qa/workunits/ceph-helpers.sh:192: kill_daemons: local trace=true ../qa/workunits/ceph-helpers.sh:193: kill_daemons: true ../qa/workunits/ceph-helpers.sh:193: kill_daemons: shopt -u -o xtrace ../qa/workunits/ceph-helpers.sh:219: kill_daemons: return 1 ./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps: return 1
Updated by Loïc Dachary over 8 years ago
@david do you have an opinion on the error log above ? It's the first time it happens.
Updated by Loïc Dachary over 8 years ago
See https://github.com/ceph/ceph/pull/6884 for an explanation of false negatives.
Updated by Loïc Dachary over 8 years ago
Updated by Loïc Dachary over 8 years ago
- Status changed from In Progress to Resolved