Hi there, sorry for delays, this was very tricky to get info on as it did not reproduce outside of our CI. So it required a bit of hackery.
I made a small patch to the metadata_manager.py to capture the state of the file, like so:
diff --git a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
index e6bcb718d20..1268ba49adf 100644
--- a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
+++ b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
@@ -42,6 +42,11 @@ class MetadataManager(object):
else:
self.config = configparser.SafeConfigParser()
+ def _capture(self, f, *args, **kwargs):
+ ret = f(*args, **kwargs)
+ log.info('CAPTURED %r', ret)
+ return ret
+
def refresh(self):
fd = None
conf_data = StringIO()
@@ -49,7 +54,7 @@ class MetadataManager(object):
log.debug("opening config {0}".format(self.config_path))
fd = self.fs.open(self.config_path, os.O_RDONLY)
while True:
- data = self.fs.read(fd, -1, MetadataManager.MAX_IO_BYTES)
+ data = self._capture(self.fs.read, fd, -1, MetadataManager.MAX_IO_BYTES)
if not len(data):
break
conf_data.write(data.decode('utf-8'))
I've attached the full log with the failure. The relevant part looks like:
2022-06-28T13:08:39.169+0000 7f27f001d700 0 log_channel(audit) log [DBG] : from='client.4391 -' entity='client.admin' cmd=[{"clone_name":"babydino2","format":"json","group_name":"Park","prefix":"fs clone cancel","vol_name":"cephfs"}]: dispatch
2022-06-28T13:08:39.169+0000 7f28057d3700 0 [volumes INFO volumes.module] Starting _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < ""
2022-06-28T13:08:39.173+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] cloning to subvolume path: /volumes/Park/babydino2
2022-06-28T13:08:39.173+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] starting clone: (cephfs, Park, babydino2)
2022-06-28T13:08:39.177+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.177+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.181+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.181+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] Delayed cloning (cephfs, Park, babydino2) -- by 0 seconds
2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.205+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] wrote 203 bytes to config b'/volumes/Park/babydino2/.meta'
2022-06-28T13:08:39.205+0000 7f28057d3700 0 [volumes ERROR volumes.module] Failed _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "":
Traceback (most recent call last):
File "/usr/share/ceph/mgr/volumes/module.py", line 38, in wrap
return f(self, inbuf, cmd)
File "/usr/share/ceph/mgr/volumes/module.py", line 784, in _cmd_fs_clone_cancel
vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))
File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 730, in clone_cancel
self.cloner.cancel_job(volname, (clonename, groupname))
File "/usr/share/ceph/mgr/volumes/fs/async_cloner.py", line 401, in cancel_job
with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:
File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py", line 73, in open_subvol
subvolume.open(op_type)
File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v2.py", line 301, in open
self.metadata_mgr.refresh()
File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 62, in refresh
self.config.readfp(conf_data)
File "/usr/lib64/python3.6/configparser.py", line 763, in readfp
self.read_file(fp, source=filename)
File "/usr/lib64/python3.6/configparser.py", line 718, in read_file
self._read(f, source)
File "/usr/lib64/python3.6/configparser.py", line 1111, in _read
raise e
configparser.ParsingError: Source contains parsing errors: '<???>'
[line 13]: 'a0\n'
2022-06-28T13:08:39.205+0000 7f28057d3700 -1 mgr handle_command module 'volumes' command handler threw exception: Source contains parsing errors: '<???>'
[line 13]: 'a0\n'
If you look carefully at the "CAPTURED" lines I added you can see a line:
2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'
This appears to be the cause of the problem. I think the read method is ending up returning the last 4 bytes of the file a second time. Perhaps there's an issue with the underlying read method?
Anyway, I hope this helps. Let me know if you need more debugging from our side as I'm now set up to test patched versions of the mgr in the go-ceph CI and gather the resulting mgr logs if needed.