Bug #46360
mgr/volumes: fs subvolume clones stuck in progress when libcephfs hits certain errors
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs.pyx, mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
During `fs subvolume clone`, libcephfs hit the "Disk quota exceeded error" that caused the subvolume clone to be stuck in progress instead of entering failed state. I could see the following traceback in the mgr log,
File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/fs_util.py", line 117, in copy_file written += fs.write(dst_fd, data[written:], -1) File "cephfs.pyx", line 1463, in cephfs.LibCephFS.write cephfs.Error: error in write: Disk quota exceeded [Errno 122] Traceback (most recent call last): File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_job.py", line 44, in run self.async_job.execute_job(vol_job[0], vol_job[1], should_cancel=lambda: thread_id.should_cancel()) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 309, in execute_job clone(self.vc, volname, job[0].decode('utf-8'), job[1].decode('utf-8'), self.state_table, should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 222, in clone start_clone_sm(volume_client, volname, index, groupname, subvolname, state_table, should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 202, in start_clone_sm (next_state, finished) = handler(volume_client, volname, index, groupname, subvolname, should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 159, in handle_clone_in_progress do_clone(volume_client, volname, groupname, subvolname, should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 155, in do_clone bulk_copy(fs_handle, src_path, dst_path, should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 144, in bulk_copy cptree(source_path, dst_path) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/async_cloner.py", line 129, in cptree copy_file(fs_handle, d_full_src, d_full_dst, mo, cancel_check=should_cancel) File "/home/rraja/git/ceph/src/pybind/mgr/volumes/fs/fs_util.py", line 120, in copy_file raise VolumeException(-e.args[0], e.args[1]) TypeError: bad operand type for unary -: 'str'
Digging further found that if a libcephfs return code is not converted into a python exception by cephfs.pyx, then cephfs.pyx raises an exception with a different argument than it normally does. See in cephfs.pyx,
cdef make_ex(ret, msg): """ Translate a librados return code into an exception. """ ret = abs(ret) if ret in errno_to_exception: return errno_to_exception[ret](ret, msg) else: return Error(msg + ': {} [Errno {:d}]'.format(os.strerror(ret), ret))
So it sometimes raises cephfs.Error(ret, msg) and sometimes cephfs.Error(msg). The mgr/volumes only handles cephfs.Error(ret, msg) correctly.
Related issues
History
#1 Updated by Ramana Raja over 2 years ago
- Subject changed from mgr/volumes: fs subvolume clones stuck in progress when certain errors are raised by licephfs to mgr/volumes: fs subvolume clones stuck in progress when libcephfs hits certain errors
- Description updated (diff)
#2 Updated by Ramana Raja over 2 years ago
- Pull request ID set to 35934
#3 Updated by Patrick Donnelly over 2 years ago
- Status changed from New to Fix Under Review
- Target version set to v16.0.0
- Backport set to octopus,nautilus
#4 Updated by Patrick Donnelly over 2 years ago
- Status changed from Fix Under Review to Pending Backport
#5 Updated by Nathan Cutler over 2 years ago
- Copied to Backport #46463: octopus: mgr/volumes: fs subvolume clones stuck in progress when libcephfs hits certain errors added
#6 Updated by Nathan Cutler over 2 years ago
- Copied to Backport #46464: nautilus: mgr/volumes: fs subvolume clones stuck in progress when libcephfs hits certain errors added
#7 Updated by Patrick Donnelly over 2 years ago
- Duplicated by Bug #47798: pybind/mgr/volumes: TypeError: bad operand type for unary -: 'str' for errno ETIMEDOUT added
#8 Updated by Nathan Cutler over 2 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".