Project

General

Profile

Actions

Bug #17084

closed

teuthology-nuke needs to kill valgrind.bin so the OSD filesystem can be nuked

Added by Yuri Weinstein over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

they were reported as stale

http://paste2.org/B81gU6F3

Actions #1

Updated by David Galloway over 7 years ago

For mira068

2016-08-19 17:26:35,101.101 INFO:teuthology.task.install:Purging /var/lib/ceph on ubuntu@mira068.front.sepia.ceph.com
2016-08-19 17:26:35,101.101 INFO:teuthology.orchestra.run.mira068:Running: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:rm: skipping ‘/var/lib/ceph/osd/ceph-2’, since it's on a different device
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:umount: /var/lib/ceph/osd: not mounted
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:umount: /var/lib/ceph/osd/ceph-2: device is busy.
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:        (In some cases useful info about processes that use
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:         the device is found by lsof(8) or fuser(1))
2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:rm: skipping ‘/var/lib/ceph/osd/ceph-2’, since it's on a different device

root@mira068:~# fuser /var/lib/ceph/osd/ceph-2
/var/lib/ceph/osd/ceph-2: 10882
root@mira068:~# ps 10882
  PID TTY      STAT   TIME COMMAND
10882 ?        Ssl  1226:28 /usr/bin/valgrind.bin --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.2.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 2
Actions #2

Updated by David Galloway over 7 years ago

  • Project changed from sepia to teuthology
  • Subject changed from can't nuke mira057 and mita068 to teuthology-nuke needs to kill valgrind.bin so the OSD filesystem can be nuked

Same deal with mira057.

I've left both machines as-is for debugging.

Actions #3

Updated by David Galloway over 7 years ago

  • Assignee deleted (David Galloway)
Actions #4

Updated by Yuri Weinstein over 7 years ago

  • Priority changed from Normal to Urgent
Actions #5

Updated by Yuri Weinstein over 7 years ago

I suspect these nodes can't be nuked now b/c of this issue:

2016-08-22 15:25:47,199.199 INFO:teuthology.nuke:targets:
smithi013.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoSH0HcJFeNYslWfI4hUCvigyrLj5wEB+L4f6tl7YpJ5gNJfOOEpwVk4XHbKrrmFssGyR/PybtLUsZwk9wDnymjvbYuMf9EuazKYH54MLVqCdRFr+2C5vaNt3nOWzRAZCybO0OLGebDiv50gfs4b1A8NkwTiwip7kAfaBoc5LU+dpIXqQI5YI3UixeIj2uKUAg9EBIw9D2UQw66WvUk1hJwHNDbZI7ivE3WF+wDLBV43RD7NDnxGY/XHPVswJESrcIX2NsmvUWuxJ6L0zmgCzXZQQsBr7e4i+xzdRE1VJkh4N3F8ML3rK8s79FwMW24WLqJYT0TuaYos2OMFLU9BeT
smithi031.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+3dhoeLJU1p8KaCV/rdfO3Ss0jD81tAL5R+ycfdl2I0/YpdWPtMvuuB5cav+bQvu3aYfWfQR5TLdDUBrNCWXlwiNj5h7bZ9VU89xUFcRGjYp9WnluvMef02rCYo4qIAHUqzqHxtA7bxJG3ecjVOdoFOjUj3+D7YG1caotkRCdmXGGnll2BKuEXCpDW4GeFFswOmaQ/6UIWJAgkU3oZmHADuJK+ZEMG4HohSPZXwW9U69RzoArUU3wi3xeYJ2jil+4vVyj0+DsvC+x1mEMe8wCZLp21ul4Kok0JMOBJihz2NRvIByC1rXY3S3vzy4s2fnEiT6v8nNCPZeKhvatJTXd
smithi039.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/1mJ+rhNEfkTmPc+VGtOhGh/sC8HkkRWO8HCR5VfbjMXgdlFhkZ6H3a7RK6okHeFq0w7gBRvCUc7MKGoDwYQHRsIIj6+UC+yH/PAlghcOcKXZ3TuEorOcNyx230BDMVFDzwpuroxo4B/L2fu5oobMzHj4rJNlO/UgktR7bZuVrV4/JQ+MFYNctzlTkLTTGauUWm0647V5JUeLuiq334CnGcyNmVUIOAsPZWeOsvesY4LwswDsLpGZAUVYGKtlacRS2DwLk0wbTjyiTeJWvhK5uA+DOwyOvuMfoTd2bY1mJImX113+t+EAhQONfS+5osd3YeuNkLtGte2iviQ9G58f
smithi050.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuffiOyq2SO9JV0jMivOpnHJ+og4ak1S5xibbNEvhSxaOhuLJgb20WD3YvT8nxjkRs6EmzKiX9FdZl0fQgN61pnAPchZf4bl1WhFf5CrbawrjMOjdJL9VgmbAvlwGIta2/q+kATAiEqw1sKlUncFQ9m8VUlup+YtSP1d9lgQbN31DG5vGxAGOCOtBDuYBbQn7yK39UwliTajFZfJTCbVHPMp7z/0A8fvSUUhAQCEyIpo5BjB6/zvqAkYjR+WQHYR+fSwKwxJmcp/wRBbfs5frwIHNboQV3v7ddvy12Oh1ip5UUMAlnEmBinDqdcYdxxXtJe4lkT6Jz0DiWVMfTCcPH
smithi053.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCmhx39gAx/BcVBBp3vopZuFBPH60C2PXesOkH0fXd290e2c2ssTfawU1iSKH8CpqbcYKwLxmrmFn4svrlzlarbrzWmk3qSrPNaWlYCjIV3475p1sCHRR+dfOXgRQwcso9aA95wG8Qah4g8RMus2z4nYSHA0BPzaEheDBE9PQXefehAdD5bcnXUi59rbIzTXEfwzQ2gNbiW715O/gd4B8wPkmgWkQvwk2giEmOZeYHSYuV89d5PvLWY4EVFqY+C8dIArc26N1gwkD0lzcYd3+lMM8yyvvP9vPEYz2YvUOPgW3O9VJY/i0SzpAxN263Hqto2ZiLqBMjHTY/Wo3codeA5
smithi058.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDyB9EgtBZpXCMn1Sa/ilGsYgwQROCmEzXlUhRFPj0YaKAYIHoxyjEp7G8T+F0lMhT/WI0PKpHo8ZLrVF1UW3jtj8FvjzavAEAjE8F1FLEONvg5s0ESH3KMwsQD7tVQnTwDgI92v51P9owdIQK+aXAO9lTVTX6oon/fNZct2XGTS7bP29oNzx4Zpa2VU18FXQH/Ce1Y3hId/dq7VBBFpXG4bHLpTr7mRpfLJHSAImtUiuYMD4rV9X6xTytvvVfZuvw0wzpBlc6zF0SFJjkhJ007xBwhlgPyai7acp5z15rffPR5u5cICRdAqg7shY7qoHgWVhexYZ+33LyTidMnZzQ7
2016-08-22 15:25:47,199.199 INFO:teuthology.nuke:Not actually nuking anything since --dry-run was passed

Actions #6

Updated by Zack Cerza over 7 years ago

  • Status changed from New to 12
  • Assignee set to Zack Cerza
Actions #7

Updated by Zack Cerza over 7 years ago

  • Status changed from 12 to Fix Under Review
Actions #8

Updated by Zack Cerza over 7 years ago

  • Status changed from Fix Under Review to Resolved
Actions #9

Updated by Yuri Weinstein over 7 years ago

still see trying to nuke nodes - http://paste2.org/t5zANVWE

Actions #10

Updated by Zack Cerza over 7 years ago

Just to be clear, that is not the same issue.

Actions #11

Updated by Vasu Kulkarni over 7 years ago

I was having the same issue and was debugging what changed, looks like the package removal is also being done at the start of the nuke, what was the reason for this change?
This will not work cleanly when the locks are held on osd's for various reasons and the node fails eventually leaving the rest of the nodes unlocked. we have to change that order back to original

Actions #13

Updated by Dan Mick over 7 years ago

What does "when locks are held on OSDs for various reasons" mean?

Package removal is supposed to stop daemons. What is the analysis of the problem with the existing operation order, and how will the suggested change in https://github.com/ceph/teuthology/pull/946 fix the problem?

Actions #14

Updated by Vasu Kulkarni over 7 years ago

Dan,

you will have to check with core dev on that, if you try to remove them without reboot it fails, that was the code earlier in nuke and worked cleanly, now you are asking question based on new nuke code change that is not working, I am not solving any problem, I am just fixing what worked before.

Actions #15

Updated by Dan Mick over 7 years ago

Sorry, I thought you'd analyzed the problem to suggest a solution.

Actions #16

Updated by Dan Mick over 7 years ago

Still don't understand why the bug can't have a clear statement of what's wrong, why, and why the suggestion fixes the problem

Actions

Also available in: Atom PDF