Bug #11398
closedtests: osd/osd-scrub-repair.sh objectstore-tool races
0%
Description
objectstore_tool races against osd shutdown in osd-scrub-repair.sh
objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/2 --journal-path testdir/osd-scrub-repair/2/journal SOMETHING list-attrs OSD has the store locked objectstore_tool: 684: return 1 corrupt_and_repair_two: 153: return 1 TEST_corrupt_and_repair_erasure_coded: 127: return 1
http://jenkins.ceph.dachary.org/job/ceph/OS=centos-7/1379/
Updated by Kefu Chai about 9 years ago
weird, objectstoretool()
uses kill_daemons()
to get the exclusive access to the underlying FileStore
. and kill_daemons()
waited until the process disappeared. the fsid flock is relinquished when ceph-osd returns.
kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -TERM 12506 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 0 kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 12506 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 1 kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 12506 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 1 kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 12506 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 1 kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 12506 kill_daemons: 178: break
Updated by Loïc Dachary about 9 years ago
Could be related to http://tracker.ceph.com/issues/11399, i.e. side effect of a hardware problem.
Updated by Loïc Dachary almost 9 years ago
- Status changed from 12 to Can't reproduce
Did not occur in the past few weeks.
Updated by Loïc Dachary almost 9 years ago
- Status changed from Can't reproduce to 12
- Regression set to No
http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04&&x86_64/5003/console
http://jenkins.ceph.dachary.org/job/ceph/LABELS=ubuntu-14.04&&x86_64/5002/console
kill_daemons: 180: sleep 3 objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/3 --journal-path testdir/osd-scrub-repair/3/journal SOMETHING list-attrs OSD has the store locked objectstore_tool: 684: return 1
on hammer https://github.com/ceph/ceph/commit/a789250cc88458a68e91968078d8a49101e5ba33
Updated by Loïc Dachary almost 9 years ago
It looks like http://tracker.ceph.com/issues/10389 but it happened on https://github.com/ceph/ceph/commit/a789250cc88458a68e91968078d8a49101e5ba33 which contains the corresponding fix https://github.com/ceph/ceph/pull/3215 https://github.com/dachary/ceph/commit/487c22a8a4b3ba099f9c19125c720e99e7c8d0db
Updated by Loïc Dachary almost 9 years ago
kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 28608 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 2 kill_daemons: 177: for try in 0 1 1 1 2 3 kill_daemons: 178: kill -0 28608 kill_daemons: 179: send_signal=0 kill_daemons: 180: sleep 3 objectstore_tool: 682: ceph-objectstore-tool --data-path testdir/osd-scrub-repair/3 --journal-path testdir/osd-scrub-repair/3/journal SOMETHING list-attrs OSD has the store locked
the daemon is not actually killed, the kill_daemons function gave up after trying for 3+2+1+1+1 seconds which can happen sometime when the machine is extra slow. It's usually not a big deal and the kill_daemons is actually documented to have that behavior. But when we are to run ceph-objectstore-tool there must not be any remaining daemons. A variant of kill_daemons must be implemented that guarantees it will either kill all daemons or fail trying.
Updated by Loïc Dachary almost 9 years ago
- Status changed from 12 to Fix Under Review
- Backport set to hammer
Updated by Loïc Dachary almost 9 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Loïc Dachary almost 9 years ago
- hammer backport https://github.com/ceph/ceph/pull/4618
Updated by Kefu Chai almost 9 years ago
- Status changed from Pending Backport to Resolved