Bug #17084
closedteuthology-nuke needs to kill valgrind.bin so the OSD filesystem can be nuked
Added by Yuri Weinstein over 7 years ago. Updated over 7 years ago.
0%
Updated by David Galloway over 7 years ago
For mira068
2016-08-19 17:26:35,101.101 INFO:teuthology.task.install:Purging /var/lib/ceph on ubuntu@mira068.front.sepia.ceph.com 2016-08-19 17:26:35,101.101 INFO:teuthology.orchestra.run.mira068:Running: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:rm: skipping ‘/var/lib/ceph/osd/ceph-2’, since it's on a different device 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:umount: /var/lib/ceph/osd: not mounted 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:umount: /var/lib/ceph/osd/ceph-2: device is busy. 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr: (In some cases useful info about processes that use 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr: the device is found by lsof(8) or fuser(1)) 2016-08-19 17:26:35,124.124 INFO:teuthology.orchestra.run.mira068.stderr:rm: skipping ‘/var/lib/ceph/osd/ceph-2’, since it's on a different device
root@mira068:~# fuser /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2: 10882 root@mira068:~# ps 10882 PID TTY STAT TIME COMMAND 10882 ? Ssl 1226:28 /usr/bin/valgrind.bin --trace-children=no --child-silent-after-fork=yes --num-callers=50 --suppressions=/home/ubuntu/cephtest/valgrind.supp --xml=yes --xml-file=/var/log/ceph/valgrind/osd.2.log --time-stamp=yes --tool=memcheck ceph-osd -f --cluster ceph -i 2
Updated by David Galloway over 7 years ago
- Project changed from sepia to teuthology
- Subject changed from can't nuke mira057 and mita068 to teuthology-nuke needs to kill valgrind.bin so the OSD filesystem can be nuked
Same deal with mira057.
I've left both machines as-is for debugging.
Updated by Yuri Weinstein over 7 years ago
- Priority changed from Normal to Urgent
Updated by Yuri Weinstein over 7 years ago
I suspect these nodes can't be nuked now b/c of this issue:
2016-08-22 15:25:47,199.199 INFO:teuthology.nuke:targets:
smithi013.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCoSH0HcJFeNYslWfI4hUCvigyrLj5wEB+L4f6tl7YpJ5gNJfOOEpwVk4XHbKrrmFssGyR/PybtLUsZwk9wDnymjvbYuMf9EuazKYH54MLVqCdRFr+2C5vaNt3nOWzRAZCybO0OLGebDiv50gfs4b1A8NkwTiwip7kAfaBoc5LU+dpIXqQI5YI3UixeIj2uKUAg9EBIw9D2UQw66WvUk1hJwHNDbZI7ivE3WF+wDLBV43RD7NDnxGY/XHPVswJESrcIX2NsmvUWuxJ6L0zmgCzXZQQsBr7e4i+xzdRE1VJkh4N3F8ML3rK8s79FwMW24WLqJYT0TuaYos2OMFLU9BeT
smithi031.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+3dhoeLJU1p8KaCV/rdfO3Ss0jD81tAL5R+ycfdl2I0/YpdWPtMvuuB5cav+bQvu3aYfWfQR5TLdDUBrNCWXlwiNj5h7bZ9VU89xUFcRGjYp9WnluvMef02rCYo4qIAHUqzqHxtA7bxJG3ecjVOdoFOjUj3+D7YG1caotkRCdmXGGnll2BKuEXCpDW4GeFFswOmaQ/6UIWJAgkU3oZmHADuJK+ZEMG4HohSPZXwW9U69RzoArUU3wi3xeYJ2jil+4vVyj0+DsvC+x1mEMe8wCZLp21ul4Kok0JMOBJihz2NRvIByC1rXY3S3vzy4s2fnEiT6v8nNCPZeKhvatJTXd
smithi039.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC/1mJ+rhNEfkTmPc+VGtOhGh/sC8HkkRWO8HCR5VfbjMXgdlFhkZ6H3a7RK6okHeFq0w7gBRvCUc7MKGoDwYQHRsIIj6+UC+yH/PAlghcOcKXZ3TuEorOcNyx230BDMVFDzwpuroxo4B/L2fu5oobMzHj4rJNlO/UgktR7bZuVrV4/JQ+MFYNctzlTkLTTGauUWm0647V5JUeLuiq334CnGcyNmVUIOAsPZWeOsvesY4LwswDsLpGZAUVYGKtlacRS2DwLk0wbTjyiTeJWvhK5uA+DOwyOvuMfoTd2bY1mJImX113+t+EAhQONfS+5osd3YeuNkLtGte2iviQ9G58f
smithi050.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuffiOyq2SO9JV0jMivOpnHJ+og4ak1S5xibbNEvhSxaOhuLJgb20WD3YvT8nxjkRs6EmzKiX9FdZl0fQgN61pnAPchZf4bl1WhFf5CrbawrjMOjdJL9VgmbAvlwGIta2/q+kATAiEqw1sKlUncFQ9m8VUlup+YtSP1d9lgQbN31DG5vGxAGOCOtBDuYBbQn7yK39UwliTajFZfJTCbVHPMp7z/0A8fvSUUhAQCEyIpo5BjB6/zvqAkYjR+WQHYR+fSwKwxJmcp/wRBbfs5frwIHNboQV3v7ddvy12Oh1ip5UUMAlnEmBinDqdcYdxxXtJe4lkT6Jz0DiWVMfTCcPH
smithi053.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCmhx39gAx/BcVBBp3vopZuFBPH60C2PXesOkH0fXd290e2c2ssTfawU1iSKH8CpqbcYKwLxmrmFn4svrlzlarbrzWmk3qSrPNaWlYCjIV3475p1sCHRR+dfOXgRQwcso9aA95wG8Qah4g8RMus2z4nYSHA0BPzaEheDBE9PQXefehAdD5bcnXUi59rbIzTXEfwzQ2gNbiW715O/gd4B8wPkmgWkQvwk2giEmOZeYHSYuV89d5PvLWY4EVFqY+C8dIArc26N1gwkD0lzcYd3+lMM8yyvvP9vPEYz2YvUOPgW3O9VJY/i0SzpAxN263Hqto2ZiLqBMjHTY/Wo3codeA5
smithi058.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDyB9EgtBZpXCMn1Sa/ilGsYgwQROCmEzXlUhRFPj0YaKAYIHoxyjEp7G8T+F0lMhT/WI0PKpHo8ZLrVF1UW3jtj8FvjzavAEAjE8F1FLEONvg5s0ESH3KMwsQD7tVQnTwDgI92v51P9owdIQK+aXAO9lTVTX6oon/fNZct2XGTS7bP29oNzx4Zpa2VU18FXQH/Ce1Y3hId/dq7VBBFpXG4bHLpTr7mRpfLJHSAImtUiuYMD4rV9X6xTytvvVfZuvw0wzpBlc6zF0SFJjkhJ007xBwhlgPyai7acp5z15rffPR5u5cICRdAqg7shY7qoHgWVhexYZ+33LyTidMnZzQ7
2016-08-22 15:25:47,199.199 INFO:teuthology.nuke:Not actually nuking anything since --dry-run was passed
Updated by Zack Cerza over 7 years ago
- Status changed from New to 12
- Assignee set to Zack Cerza
Updated by Zack Cerza over 7 years ago
- Status changed from 12 to Fix Under Review
Updated by Zack Cerza over 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Yuri Weinstein over 7 years ago
still see trying to nuke nodes - http://paste2.org/t5zANVWE
Updated by Zack Cerza over 7 years ago
Just to be clear, that is not the same issue.
Updated by Vasu Kulkarni over 7 years ago
I was having the same issue and was debugging what changed, looks like the package removal is also being done at the start of the nuke, what was the reason for this change?
This will not work cleanly when the locks are held on osd's for various reasons and the node fails eventually leaving the rest of the nodes unlocked. we have to change that order back to original
Updated by Vasu Kulkarni over 7 years ago
Updated by Dan Mick over 7 years ago
What does "when locks are held on OSDs for various reasons" mean?
Package removal is supposed to stop daemons. What is the analysis of the problem with the existing operation order, and how will the suggested change in https://github.com/ceph/teuthology/pull/946 fix the problem?
Updated by Vasu Kulkarni over 7 years ago
Dan,
you will have to check with core dev on that, if you try to remove them without reboot it fails, that was the code earlier in nuke and worked cleanly, now you are asking question based on new nuke code change that is not working, I am not solving any problem, I am just fixing what worked before.
Updated by Dan Mick over 7 years ago
Sorry, I thought you'd analyzed the problem to suggest a solution.
Updated by Dan Mick over 7 years ago
Still don't understand why the bug can't have a clear statement of what's wrong, why, and why the suggestion fixes the problem