Bug #6701: ceph-deploy osd prepare on directory path fails: OSError: [Errno 18] Invalid cross-device link - devops - Ceph

Actions

Copy link

Bug #6701

closed

ceph-deploy osd prepare on directory path fails: OSError: [Errno 18] Invalid cross-device link

Added by Mark Kirkwood over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Alfredo Deza

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Ceph version is 0.71-234-g1f02d00 built from src on bunti 13.10.

The desired setup is osd data in /data2/cephdata journal on /dev/sda9

$ sudo ceph-deploy -v osd prepare zmori:/data2/cephdata:/dev/sda9
[ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy -v osd prepare zmori:/data2/cephdata:/dev/sda9
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks zmori:/data2/cephdata:/dev/sda9
[zmori][DEBUG ] connected to host: zmori
[zmori][DEBUG ] detect platform information from remote host
[zmori][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: Ubuntu 13.10 saucy
[ceph_deploy.osd][DEBUG ] Deploying osd to zmori
[zmori][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][ERROR ] OSError: [Errno 18] Invalid cross-device link
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

The problem appears to be the data device path, as attempting to prepare an osd with just /data2/cephdata gives the same error.

I'm using ceph deploy from git, I note that a checkout from 2013-10-03 NZST does not have this issue. The 'Invalid cross-device link' started popping up around 2013-10-17 NZST, and is present in current master (2013-11-01 NZST)

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

Omitted the probably significant fact that /data2 is a partition in a different disk from /var

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

The particular issue is caused by os.rename in ceph_deploy/hosts/remotes.py line 54. replacing that with shutil.move seems to be the usual solution - however this brings to light another issue:

zmori][INFO ] Running command: sudo udevadm trigger --subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host zmori disk /data2/cephdata/ journal None activate False
[zmori][INFO ] Running command: sudo ceph-disk-prepare --fs-type xfs --cluster ceph -- /data2/cephdata/
[zmori][ERROR ] ceph-disk: Error: getting cluster uuid from configuration failed
[zmori][ERROR ] Traceback (most recent call last):
[zmori][ERROR ] File "/home/markir/develop/python/ceph-deploy/ceph_deploy/lib/remoto/process.py", line 68, in run
[zmori][ERROR ] reporting(conn, result, timeout)
[zmori][ERROR ] File "/home/markir/develop/python/ceph-deploy/ceph_deploy/lib/remoto/log.py", line 13, in reporting
[zmori][ERROR ] received = result.receive(timeout)
[zmori][ERROR ] File "/home/markir/develop/python/ceph-deploy/ceph_deploy/lib/remoto/lib/execnet/gateway_base.py", line 455, in receive
[zmori][ERROR ] raise self._getremoteerror() or EOFError()
[zmori][ERROR ] RemoteError: Traceback (most recent call last):
[zmori][ERROR ] File "/home/markir/develop/python/ceph-deploy/ceph_deploy/lib/remoto/lib/execnet/gateway_base.py", line 806, in executetask
[zmori][ERROR ] function(channel, **kwargs)
[zmori][ERROR ] File "", line 35, in _remote_run
[zmori][ERROR ] RuntimeError: command returned non-zero exit status: 1
[zmori][ERROR ]
[zmori][ERROR ]
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /data2/cephdata/
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

Looks like non-whole device setups are being broken here.

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

Further on this (post the os.rename -> shutil.move), the next problem is:

[ERROR] ceph-disk: Error: getting cluster uuid from configuration failed

This is because the config file has been reduced to zero bytes, e.g: after mon create:

$ ls -l /etc/ceph/ceph.conf 
-rw-r--r-- 1 root root 187 Nov  3 13:22 /etc/ceph/ceph.conf

after attempting osd prepare:

$ ls -l /etc/ceph/ceph.conf
-rw------- 1 root root 0 Nov  3 13:23 /etc/ceph/ceph.conf

I'll see if I can figure out why...

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

I'm possibly causing the issue using shutil.move (can't see how mind you)...

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

I now know why the original error is happening. My previous musings were not really on the mark (as it were):

consider the df output on the workstation:

Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda3        3871400   320504   3334528   9% /
/dev/sda5       15616412    40648  14759432   1% /tmp

The ceph-deploy code circa version 1.3 is calling hosts/remotes.py:write_conf, which is:
- seeing if /etc/ceph/ceph.conf exists (it does)
- checking if it is different from ceph-deploy's conf (it is it seems [1])
- creates a temp file, writes the conf to that and renames the result to /etc/ceph/ceph.conf

That last step is failing because os.rename will not rename from /tmp to / (etc) i.e accross filesystems. As I said aboce, the usual fix for that is shutil.move - I am having trouble getting that to work (I'll investigate why).

As an minor aside, I'm wondering why, instead of the temp file stuff we don't just do:

*** remotes.py.orig    2013-11-03 14:25:19.589216186 +1300
--- remotes.py    2013-11-03 14:25:07.933515056 +1300
***************
*** 50,56 ****
              if old != conf and not overwrite:
                  raise RuntimeError(err_msg)
          tmp_file.write(conf)
!         os.rename(tmp_file.name, path)
          return
      if os.path.exists('/etc/ceph'):
          with open(path, 'w') as f:
--- 50,58 ----
              if old != conf and not overwrite:
                  raise RuntimeError(err_msg)
          tmp_file.write(conf)
!         with open(path, 'w') as fw:
!             fw.truncate()
!             fw.write(conf)
          return
      if os.path.exists('/etc/ceph'):
          with open(path, 'w') as f:

[1] I'm a little puzzled whay it is, since ceph-deploy is the only thing touching it at this point...

Actions

Copy link

Updated by Mark Kirkwood over 10 years ago

Figured out what the issue with shutil.move was - needed to close the temp file before moving. Not an issue with os.rename as I think it is using a hard link under the covers. This presumably means shutil.move is not atomic..sigh. But her;s the woking patch anyway:

*** remotes.py.orig     2013-11-03 14:25:19.589216186 +1300
--- remotes.py  2013-11-03 15:37:52.654655203 +1300
***************
*** 1,6 ****
--- 1,7 ----
  import errno
  import socket
  import os
+ import shutil
  import tempfile
  import platform

***************
*** 50,56 ****
              if old != conf and not overwrite:
                  raise RuntimeError(err_msg)
          tmp_file.write(conf)
!         os.rename(tmp_file.name, path)
          return
      if os.path.exists('/etc/ceph'):
          with open(path, 'w') as f:
--- 51,58 ----
              if old != conf and not overwrite:
                  raise RuntimeError(err_msg)
          tmp_file.write(conf)
!         tmp_file.close()
!         shutil.move(tmp_file.name, path)
          return
      if os.path.exists('/etc/ceph'):
          with open(path, 'w') as f:

Actions

Copy link

Updated by Ian Colle over 10 years ago

Status changed from New to Fix Under Review

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Alfredo Deza

Actions

Copy link

Updated by Alfredo Deza over 10 years ago

Thanks for the ticket and the resolution Mark!

Would you mind sending a pull request to https://github.com/ceph/ceph-deploy ? That way your contribution is saved there :)

Make sure you sign the commit with `-s` as well!

Actions

Copy link

#10

Updated by Mark Kirkwood over 10 years ago

Done.

Actions

Copy link

#11

Updated by Alfredo Deza over 10 years ago

Status changed from Fix Under Review to Resolved

Pull Request opened: https://github.com/ceph/ceph-deploy/pull/126

And merged into ceph-deploy's master branch with hash: aeaaf11

Actions

Copy link

#12

Updated by Mark Kirkwood over 10 years ago

Managed to provoke this again, this time creating a keyring for an osd on a host that is not a monitor. The tiggering factor seems to be /tmp being a seperate filesystem. Experimental patch here that seems to fix it (shutil.move again)

https://github.com/markir9/ceph-deploy/commit/826433886a4f1215e1dcd07d57c13f43a2b12153

Actions

Copy link

#13

Updated by Alfredo Deza over 10 years ago

There was a PR addressing the problem for using shutil.move and I just opened another one to fix the missing `close()` call

https://github.com/ceph/ceph-deploy/pull/137/files

Actions

Copy link

#14

Updated by Alfredo Deza over 10 years ago

Status changed from Resolved to Fix Under Review

Actions

Copy link

#15

Updated by Mark Kirkwood over 10 years ago

This will be fine for temporary files opened with 'delete=False' - if we start using delete=True then they will be possibly destroyed before we can copy them.

I did wonder if simply flushing the temp file before moving it might work, but only thought of that after doing the close + move patch!

Actions

Copy link

#16