Project

General

Profile

Bug #13986

TEST_scrub_snaps: timeout > 300

Added by Loic Dachary almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
12/05/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/9864/consoleFull

.../qa/workunits/ceph-helpers.sh:874: get_num_active_clean:  xmlstarlet sel -t -m '//pg_stat/state[(contains(.,'\''active'\'') and contains(.,'\''clean'\'') and not(contains(.,'\''stale'\'')))]' -v . -n
.../qa/workunits/ceph-helpers.sh:873: get_num_active_clean:  ceph --format xml pg dump pgs
../qa/workunits/ceph-helpers.sh:956: is_clean:  test 5 = 5
../qa/workunits/ceph-helpers.sh:1003: wait_for_clean:  return 0
./test/osd/osd-scrub-snaps.sh:153: TEST_scrub_snaps:  local pgid=2.0
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps:  pg_scrub 2.0
../qa/workunits/ceph-helpers.sh:1059: pg_scrub:  local pgid=2.0
.../qa/workunits/ceph-helpers.sh:1060: pg_scrub:  get_last_scrub_stamp 2.0
.../qa/workunits/ceph-helpers.sh:928: get_last_scrub_stamp:  local pgid=2.0
.../qa/workunits/ceph-helpers.sh:929: get_last_scrub_stamp:  ceph --format xml pg dump pgs
.../qa/workunits/ceph-helpers.sh:930: get_last_scrub_stamp:  xmlstarlet sel -t -m '//pg_stat[pgid='\''2.0'\'']/last_scrub_stamp' -v .
../qa/workunits/ceph-helpers.sh:1060: pg_scrub:  local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1061: pg_scrub:  ceph pg scrub 2.0
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
instructing pg 2.0 on osd.0 to scrub
../qa/workunits/ceph-helpers.sh:1062: pg_scrub:  wait_for_scrub 2.0 '2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1142: wait_for_scrub:  local pgid=2.0
../qa/workunits/ceph-helpers.sh:1143: wait_for_scrub:  local 'last_scrub=2015-12-05 03:06:17.448168'
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i=0 ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i < 300 ))
.../qa/workunits/ceph-helpers.sh:1146: wait_for_scrub:  get_last_scrub_stamp 2.0
...
../qa/workunits/ceph-helpers.sh:1149: wait_for_scrub:  sleep 1
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i++ ))
../qa/workunits/ceph-helpers.sh:1145: wait_for_scrub:  (( i < 300 ))
../qa/workunits/ceph-helpers.sh:1151: wait_for_scrub:  return 1
./test/osd/osd-scrub-snaps.sh:154: TEST_scrub_snaps:  return 1
./test/osd/osd-scrub-snaps.sh:30: run:  return 1
../qa/workunits/ceph-helpers.sh:1242: main:  code=1
../qa/workunits/ceph-helpers.sh:1244: main:  teardown testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:117: teardown:  local dir=testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:118: teardown:  kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons:  local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons:  return 0
.../qa/workunits/ceph-helpers.sh:119: teardown:  stat -f -c %T .
../qa/workunits/ceph-helpers.sh:119: teardown:  '[' ext2/ext3 == btrfs ']'
../qa/workunits/ceph-helpers.sh:122: teardown:  rm -fr testdir/osd-scrub-snaps
../qa/workunits/ceph-helpers.sh:1245: main:  return 1

test.out.gz (115 KB) Loic Dachary, 12/08/2015 04:56 PM

test2.out.gz (114 KB) Loic Dachary, 12/09/2015 01:07 AM


Related issues

Related to Ceph - Bug #14027: tests: set --osd-scrub-load-threshold=2000 Resolved 12/09/2015

Associated revisions

Revision 2c7c7129 (diff)
Added by Loic Dachary almost 4 years ago

tests: osd-scrub-snaps.sh to display full osd logs on error

This generates a large output but is the only way to get more
information about race conditions that turn out to show up frequently.

http://tracker.ceph.com/issues/13986 Refs: #13986

Signed-off-by: Loic Dachary <>

History

#3 Updated by Loic Dachary almost 4 years ago

https://github.com/ceph/ceph/pull/6857 to collect more info about the possible race

#4 Updated by Loic Dachary almost 4 years ago

Running

while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done

on rex003

#5 Updated by Loic Dachary almost 4 years ago

  • Status changed from Need More Info to Duplicate

#6 Updated by Loic Dachary almost 4 years ago

  • Duplicates Bug #13988: new OSD re-using old OSD id fails to boot added

#7 Updated by Loic Dachary almost 4 years ago

  • Duplicates deleted (Bug #13988: new OSD re-using old OSD id fails to boot)

#8 Updated by Loic Dachary almost 4 years ago

#9 Updated by Loic Dachary almost 4 years ago

  • Status changed from Duplicate to Need Review
  • Assignee set to Loic Dachary

#11 Updated by Loic Dachary almost 4 years ago

The relevant lines of the osd logs are:

2015-12-08 14:45:53.349784 7f3a6c534700 20 osd.0 19 scrub_load_below_threshold loadavg 32.23 >= max 0.5 and ( >= daily_loadavg 13.3244 or >= 15m avg 13.39) = no
2015-12-08 14:45:53.349806 7f3a6c534700 20 osd.0 19 sched_scrub load_is_low=0
2015-12-08 14:45:53.349809 7f3a6c534700 10 osd.0 19 sched_scrub 2.0 schedued at 2015-12-09 19:55:20.323080 > 2015-12-08 14:45:53.349745

#12 Updated by Loic Dachary almost 4 years ago

#13 Updated by Loic Dachary almost 4 years ago

crash with monitor logs and --osd-scrub-load-threshold=2000 http://dachary.org/loic/crash.gz

#14 Updated by Loic Dachary almost 4 years ago

  • Related to Bug #14027: tests: set --osd-scrub-load-threshold=2000 added

#15 Updated by Loic Dachary almost 4 years ago

  • Status changed from Need Review to In Progress

#16 Updated by Loic Dachary almost 4 years ago

To reproduce the problem:

  • in the background run
    while : ; do echo -n . ; ./run-make-check.sh > run.out 2>&1 ; done
    
  • on master, after a ./run-make-check.sh
    cd src ; while : ; do echo -n . ; test/osd/osd-scrub-snaps.sh > test.out 2>&1 || break ; done
    

Using the commit from https://github.com/ceph/ceph/pull/6860 to display the monitor and osd log, test.out will have all the necessary diagnostic information.

#18 Updated by Loic Dachary almost 4 years ago

http://jenkins.ceph.dachary.org/job/ceph/LABELS=centos-7&&x86_64/10048/console

it has a new kind of error log. The kill_daemon function returned on error, meaning it was unable to kill the daemons before the timeout (~3 minutes). I've never seen this but it's not impossible if the machine is heavily overloaded.

./qa/workunits/ceph-helpers.sh:1147: wait_for_scrub:  test '2015-12-10 11:22:18.817416' '!=' '2015-12-10 11:23:06.796504'
../qa/workunits/ceph-helpers.sh:1148: wait_for_scrub:  return 0
./test/osd/osd-scrub-snaps.sh:158: TEST_scrub_snaps:  grep log_channel testdir/osd-scrub-snaps/osd.0.log
2015-12-10 11:23:06.791343 7f7b7e136700  0 log_channel(cluster) log [INF] : 2.0 scrub starts
2015-12-10 11:23:06.794271 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/2acecc8b/obj10/1 is missing in clone_overlap
2015-12-10 11:23:06.794328 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 no '_' attr
2015-12-10 11:23:06.794336 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/7 is an unexpected clone
2015-12-10 11:23:06.794377 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/4 on disk size (4608) does not match object info size (512) adjusted for ondisk to (512)
2015-12-10 11:23:06.794443 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/2
2015-12-10 11:23:06.794449 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/666934a3/obj5/head expected clone 2/666934a3/obj5/1
2015-12-10 11:23:06.794454 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/666934a3/obj5/head 2 missing clone(s)
2015-12-10 11:23:06.794481 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/d3a9faf5/obj12/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794637 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/8df7eaa5/obj8/head snaps.seq not set
2015-12-10 11:23:06.794717 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/head snapset.head_exists=false, but head exists
2015-12-10 11:23:06.794741 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/5c889059/obj7/1 is an unexpected clone
2015-12-10 11:23:06.794750 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/61f68bb1/obj3/head on disk size (3840) does not match object info size (768) adjusted for ondisk to (768)
2015-12-10 11:23:06.794900 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/83425cc4/obj6/1 is an unexpected clone
2015-12-10 11:23:06.794928 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/3f1ee208/obj2/snapdir no 'snapset' attr
2015-12-10 11:23:06.794948 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/7 clone ignored due to missing snapset
2015-12-10 11:23:06.794967 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/3f1ee208/obj2/4 clone ignored due to missing snapset
2015-12-10 11:23:06.795017 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/a8759770/obj4/snapdir expected clone 2/a8759770/obj4/7
2015-12-10 11:23:06.795023 7f7b7e136700  0 log_channel(cluster) log [INF] : scrub 2.0 2/a8759770/obj4/snapdir 1 missing clone(s)
2015-12-10 11:23:06.796212 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/6cf8deff/obj1/1 is an unexpected clone
2015-12-10 11:23:06.796283 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/e478ac7f/obj9/1 is missing in clone_size
2015-12-10 11:23:06.796337 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/29547577/obj11/1 is an unexpected clone
2015-12-10 11:23:06.796412 7f7b7e136700 -1 log_channel(cluster) log [ERR] : scrub 2.0 2/94122507/obj14/1 size 1032 != clone_size 1033
2015-12-10 11:23:06.796482 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub stat mismatch, got 21/31 objects, 10/19 clones, 21/31 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 7681/13064 bytes, 0/0 hit_set_archive bytes.
2015-12-10 11:23:06.796500 7f7b7e136700 -1 log_channel(cluster) log [ERR] : 2.0 scrub 19 errors
../test/osd/osd-scrub-snaps.sh:34: TEST_scrub_snaps:  seq 1 7
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap1
removed pool test snap snap1
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap2
removed pool test snap snap2
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap3
removed pool test snap snap3
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap4
removed pool test snap snap4
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap5
removed pool test snap snap5
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap6
removed pool test snap snap6
./test/osd/osd-scrub-snaps.sh:160: TEST_scrub_snaps:  for i in '`seq 1 7`'
./test/osd/osd-scrub-snaps.sh:162: TEST_scrub_snaps:  rados -p test rmsnap snap7
removed pool test snap snap7
./test/osd/osd-scrub-snaps.sh:165: TEST_scrub_snaps:  ERRORS=0
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  find testdir/osd-scrub-snaps
../test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  grep '[^/]*\.pid'
./test/osd/osd-scrub-snaps.sh:167: TEST_scrub_snaps:  pidfile='testdir/osd-scrub-snaps/mon.a.pid
testdir/osd-scrub-snaps/osd.0.pid'
../test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps:  cat testdir/osd-scrub-snaps/mon.a.pid testdir/osd-scrub-snaps/osd.0.pid
./test/osd/osd-scrub-snaps.sh:168: TEST_scrub_snaps:  pid='32405
16134'
./test/osd/osd-scrub-snaps.sh:169: TEST_scrub_snaps:  kill -0 32405 16134
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps:  kill_daemons testdir/osd-scrub-snaps
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  shopt -q -o xtrace
.../qa/workunits/ceph-helpers.sh:192: kill_daemons:  echo true
../qa/workunits/ceph-helpers.sh:192: kill_daemons:  local trace=true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  true
../qa/workunits/ceph-helpers.sh:193: kill_daemons:  shopt -u -o xtrace
../qa/workunits/ceph-helpers.sh:219: kill_daemons:  return 1
./test/osd/osd-scrub-snaps.sh:176: TEST_scrub_snaps:  return 1

#19 Updated by Loic Dachary almost 4 years ago

@david do you have an opinion on the error log above ? It's the first time it happens.

#20 Updated by Loic Dachary almost 4 years ago

See https://github.com/ceph/ceph/pull/6884 for an explanation of false negatives.

#22 Updated by Loic Dachary almost 4 years ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF