Bug #6807
closedDebian Wheezy Teuthology Ceph-deploy run failed.
0%
Description
/a/teuthology-2013-11-19_01:10:01-ceph-deploy-next-testing-basic-vps/108339 needs to be investigated. The run reports coredumps, and RuntimeError: Failed to execute command: rm rf --one-file-system - /var/lib/ceph
UPDATE
Something that is not directly related to ceph-deploy is triggering this behavior of having OSDs mounted which is why ceph-deploy
can't remove the contents.
A better error message should be put in place so that when the error is triggered it is clear that there might be OSDs still present.
Updated by Anonymous over 10 years ago
That error should read:
RuntimeError: Failed to execute command: rm -rf --one-file-system -- /var/lib/ceph
Updated by Alfredo Deza over 10 years ago
- Status changed from New to 12
Good catch, I think that this is the culprit:
[[1;37mINFO[0m ] Running command: sudo rm -rf --one-file-system -- /var/lib/ceph [[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-3', since it's on a different device [[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-2', since it's on a different device
So for ceph-deploy, this is something that should complete normally as long as the directories its removing are on the same file system.
I'm not sure about the historical reasons why these commands where done with `--one-file-system`, but removing that constraint would definitely
fix this problem
Updated by Alfredo Deza over 10 years ago
- Status changed from 12 to 4
- Priority changed from Normal to High
It looks like the reason we were enforcing the single file system was because we might still have OSDs mounted (hence the different file system) and we might not
want to remove them.
The current resolution to fix this is that we should probably have a big WARNING message if we are unable to remove the contents for this reason and not error out.
Before implementing this I would really like some confirmation that this is indeed the correct path.
Updated by Greg Farnum over 10 years ago
It sounds like there was an earlier problem with the test or a different failure — why is it trying to delete the ceph directories (purge, presumably) while it still has OSDs mounted on the filesystem? We don't want to "succeed" at removing the directory if there was stuff we couldn't get rid of...
Updated by Alfredo Deza over 10 years ago
Just found out that before, we would just try to remove `/var/lib/ceph` and then we would check if that is still there and then attempt to unmount
anything that might still be lingering around.
When ceph-deploy made the change to actually exit when system calls are returning non-zero exit status this logic would not work any longer, so it would
prevent the unmounting of potential OSDs that might still be mounted at that location.
The fix then, is no longer adding a warning to that action, but actually dealing with the error (if it appears) and then unmount, just like it was before.
Updated by Alfredo Deza over 10 years ago
- Status changed from In Progress to Fix Under Review
Pull Request opened: https://github.com/ceph/ceph-deploy/pull/139
Updated by Alfredo Deza over 10 years ago
- Status changed from Fix Under Review to Resolved
Merged into ceph-deploy's master branch with hash: 109040e