Project

General

Profile

Actions

Bug #11398

closed

tests: osd/osd-scrub-repair.sh objectstore-tool races

Added by Loïc Dachary about 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

objectstore_tool races against osd shutdown in osd-scrub-repair.sh

objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/2 --journal-path testdir/osd-scrub-repair/2/journal SOMETHING list-attrs
OSD has the store locked
objectstore_tool: 684: return 1
corrupt_and_repair_two: 153: return 1
TEST_corrupt_and_repair_erasure_coded: 127: return 1

http://jenkins.ceph.dachary.org/job/ceph/OS=centos-7/1379/


Related issues 1 (0 open1 closed)

Related to Ceph - Bug #11930: test/ceph_objectstore_tool.py fails with OSD has the store lockedCan't reproduceDavid Zafman06/09/2015

Actions
Actions #1

Updated by Kefu Chai about 9 years ago

weird, objectstoretool() uses kill_daemons() to get the exclusive access to the underlying FileStore. and kill_daemons() waited until the process disappeared. the fsid flock is relinquished when ceph-osd returns.

kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -TERM 12506
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 0
kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 12506
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 1
kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 12506
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 1
kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 12506
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 1
kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 12506
kill_daemons: 178: break
Actions #2

Updated by Loïc Dachary about 9 years ago

Could be related to http://tracker.ceph.com/issues/11399, i.e. side effect of a hardware problem.

Actions #3

Updated by Loïc Dachary almost 9 years ago

  • Status changed from 12 to Can't reproduce

Did not occur in the past few weeks.

Actions #4

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Can't reproduce to 12
  • Regression set to No

http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04&&x86_64/5003/console
http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04&&x86_64/5002/console

kill_daemons: 180: sleep 3
objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/3 --journal-path testdir/osd-scrub-repair/3/journal SOMETHING list-attrs
OSD has the store locked
objectstore_tool: 684: return 1

on hammer https://github.com/ceph/ceph/commit/a789250cc88458a68e91968078d8a49101e5ba33

Actions #6

Updated by Loïc Dachary almost 9 years ago

kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 28608
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 2
kill_daemons: 177: for try in 0 1 1 1 2 3
kill_daemons: 178: kill -0 28608
kill_daemons: 179: send_signal=0
kill_daemons: 180: sleep 3
objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/3 --journal-path testdir/osd-scrub-repair/3/journal SOMETHING list-attrs
OSD has the store locked

the daemon is not actually killed, the kill_daemons function gave up after trying for 3+2+1+1+1 seconds which can happen sometime when the machine is extra slow. It's usually not a big deal and the kill_daemons is actually documented to have that behavior. But when we are to run ceph-objectstore-tool there must not be any remaining daemons. A variant of kill_daemons must be implemented that guarantees it will either kill all daemons or fail trying.
Actions #7

Updated by Loïc Dachary almost 9 years ago

  • Status changed from 12 to Fix Under Review
  • Backport set to hammer
Actions #8

Updated by Loïc Dachary almost 9 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #10

Updated by Kefu Chai almost 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF