Bug #6807: Debian Wheezy Teuthology Ceph-deploy run failed. - Ceph - Ceph

Actions

Copy link

Bug #6807

closed

Debian Wheezy Teuthology Ceph-deploy run failed.

Added by Anonymous over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

High

Assignee:

Alfredo Deza

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/a/teuthology-2013-11-19_01:10:01-ceph-deploy-next-testing-basic-vps/108339 needs to be investigated. The run reports coredumps, and RuntimeError: Failed to execute command: rm ~~rf --one-file-system -~~ /var/lib/ceph

UPDATE
Something that is not directly related to ceph-deploy is triggering this behavior of having OSDs mounted which is why ceph-deploy
can't remove the contents.

A better error message should be put in place so that when the error is triggered it is clear that there might be OSDs still present.

Actions

Copy link

Updated by Anonymous over 10 years ago

That error should read:

RuntimeError: Failed to execute command: rm -rf --one-file-system -- /var/lib/ceph

Actions

Copy link

Updated by Anonymous over 10 years ago

Assignee set to Alfredo Deza

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Status changed from New to 12

Good catch, I think that this is the culprit:

[[1;37mINFO[0m  ] Running command: sudo rm -rf --one-file-system -- /var/lib/ceph
[[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-3', since it's on a different device
[[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-2', since it's on a different device

So for ceph-deploy, this is something that should complete normally as long as the directories its removing are on the same file system.

I'm not sure about the historical reasons why these commands where done with `--one-file-system`, but removing that constraint would definitely
fix this problem

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Status changed from 12 to 4
Priority changed from Normal to High

It looks like the reason we were enforcing the single file system was because we might still have OSDs mounted (hence the different file system) and we might not
want to remove them.

The current resolution to fix this is that we should probably have a big WARNING message if we are unable to remove the contents for this reason and not error out.

Before implementing this I would really like some confirmation that this is indeed the correct path.

Actions

Copy link

Updated by Greg Farnum over 10 years ago

It sounds like there was an earlier problem with the test or a different failure — why is it trying to delete the ceph directories (purge, presumably) while it still has OSDs mounted on the filesystem? We don't want to "succeed" at removing the directory if there was stuff we couldn't get rid of...

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Status changed from 4 to In Progress

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Just found out that before, we would just try to remove `/var/lib/ceph` and then we would check if that is still there and then attempt to unmount
anything that might still be lingering around.

When ceph-deploy made the change to actually exit when system calls are returning non-zero exit status this logic would not work any longer, so it would
prevent the unmounting of potential OSDs that might still be mounted at that location.

The fix then, is no longer adding a warning to that action, but actually dealing with the error (if it appears) and then unmount, just like it was before.

Actions

Copy link