Project

General

Profile

Actions

Bug #6807

closed

Debian Wheezy Teuthology Ceph-deploy run failed.

Added by Anonymous over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2013-11-19_01:10:01-ceph-deploy-next-testing-basic-vps/108339 needs to be investigated. The run reports coredumps, and RuntimeError: Failed to execute command: rm rf --one-file-system - /var/lib/ceph

UPDATE
Something that is not directly related to ceph-deploy is triggering this behavior of having OSDs mounted which is why ceph-deploy
can't remove the contents.

A better error message should be put in place so that when the error is triggered it is clear that there might be OSDs still present.

Actions #1

Updated by Anonymous over 10 years ago

That error should read:

RuntimeError: Failed to execute command: rm -rf --one-file-system -- /var/lib/ceph
Actions #2

Updated by Anonymous over 10 years ago

  • Assignee set to Alfredo Deza
Actions #3

Updated by Alfredo Deza over 10 years ago

  • Status changed from New to 12

Good catch, I think that this is the culprit:

[[1;37mINFO[0m  ] Running command: sudo rm -rf --one-file-system -- /var/lib/ceph
[[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-3', since it's on a different device
[[1;31mERROR[0m ] rm: skipping `/var/lib/ceph/osd/ceph-2', since it's on a different device

So for ceph-deploy, this is something that should complete normally as long as the directories its removing are on the same file system.

I'm not sure about the historical reasons why these commands where done with `--one-file-system`, but removing that constraint would definitely
fix this problem

Actions #4

Updated by Alfredo Deza over 10 years ago

  • Status changed from 12 to 4
  • Priority changed from Normal to High

It looks like the reason we were enforcing the single file system was because we might still have OSDs mounted (hence the different file system) and we might not
want to remove them.

The current resolution to fix this is that we should probably have a big WARNING message if we are unable to remove the contents for this reason and not error out.

Before implementing this I would really like some confirmation that this is indeed the correct path.

Actions #5

Updated by Greg Farnum over 10 years ago

It sounds like there was an earlier problem with the test or a different failure — why is it trying to delete the ceph directories (purge, presumably) while it still has OSDs mounted on the filesystem? We don't want to "succeed" at removing the directory if there was stuff we couldn't get rid of...

Actions #6

Updated by Alfredo Deza over 10 years ago

  • Description updated (diff)
Actions #7

Updated by Alfredo Deza over 10 years ago

  • Status changed from 4 to In Progress
Actions #8

Updated by Alfredo Deza over 10 years ago

Just found out that before, we would just try to remove `/var/lib/ceph` and then we would check if that is still there and then attempt to unmount
anything that might still be lingering around.

When ceph-deploy made the change to actually exit when system calls are returning non-zero exit status this logic would not work any longer, so it would
prevent the unmounting of potential OSDs that might still be mounted at that location.

The fix then, is no longer adding a warning to that action, but actually dealing with the error (if it appears) and then unmount, just like it was before.

Actions #9

Updated by Alfredo Deza over 10 years ago

  • Status changed from In Progress to Fix Under Review
Actions #10

Updated by Alfredo Deza over 10 years ago

  • Status changed from Fix Under Review to Resolved

Merged into ceph-deploy's master branch with hash: 109040e

Actions

Also available in: Atom PDF