Bug #55583
closedIntermittent ParsingError failure in mgr/volumes module during "clone cancel"
100%
Description
This issue is a bit difficult to explain so please bear with me.
On quincy, but not on octopus or pacific, a test runs that attempt to cancel a clone. Sometimes when this test runs it fails with an python traceback in the mgr response. This traceback indicates a configparser.ParsingError was raised.
rados: ret=-22, Invalid argument: "Traceback (most recent call last):\n File \"/usr/share/ceph/mgr/mgr_module.py\", line 1701, in _handle_command\n return self.handle_command(inbuf, cmd)\n File \"/usr/share/ceph/mgr/volumes/module.py\", line 409, in handle_command\n return handler(inbuf, cmd)\n File \"/usr/share/ceph/mgr/volumes/module.py\", line 38, in wrap\n return f(self, inbuf, cmd)\n File \"/usr/share/ceph/mgr/volumes/module.py\", line 636, in _cmd_fs_clone_cancel\n vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))\n File \"/usr/share/ceph/mgr/volumes/fs/volume.py\", line 582, in clone_cancel\n self.cloner.cancel_job(volname, (clonename, groupname))\n File \"/usr/share/ceph/mgr/volumes/fs/async_cloner.py\", line 389, in cancel_job\n with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:\n File \"/usr/lib64/python3.6/contextlib.py\", line 81, in __enter__\n return next(self.gen)\n File \"/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py\", line 72, in open_subvol\n subvolume = loaded_subvolumes.get_subvolume_object(mgr, fs, vol_spec, group, subvolname)\n File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py\", line 95, in get_subvolume_object\n subvolume.discover()\n File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_base.py\", line 319, in discover\n self.metadata_mgr.refresh()\n File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py\", line 52, in refresh\n self.config.readfp(conf_data)\n File \"/usr/lib64/python3.6/configparser.py\", line 763, in readfp\n self.read_file(fp, source=filename)\n File \"/usr/lib64/python3.6/configparser.py\", line 718, in read_file\n self._read(f, source)\n File \"/usr/lib64/python3.6/configparser.py\", line 1111, in _read\n raise e\nconfigparser.ParsingError: Source contains parsing errors: '<???>'\n\t[line 13]: 'a0\\n'\n"
(I've added newlines back to the traceback to make it more readable. It's possible I missed some)
The automated test is part of the go-ceph project. We're tracking the issue on our side here:
https://github.com/ceph/go-ceph/issues/679
Since this one has been hard for me to track down I'm filing the issue with the express request that I can get some additional hints on how to debug this further.
One additional challenge is that I can not reproduce this on my desktop system, but it does occur regularly in our CI. We're blocked officially supporting quincy until this test, as well as some others, are better understood. Thanks!
Files
Updated by John Mulligan almost 2 years ago
This appears to be affecting ceph pacific now with the v16.2.9 release.
Updated by Venky Shankar almost 2 years ago
- Assignee set to Kotresh Hiremath Ravishankar
- Target version set to v18.0.0
- Backport set to quincy, pacific
- Severity changed from 3 - minor to 2 - major
Kotresh, please take a look.
Updated by Kotresh Hiremath Ravishankar almost 2 years ago
Hi John,
This looks like a race. If this reproduces consistently with the test suite, could you please share the config file located at '/mnt/volumes/<groupname>/<subvolumename>/.meta' ?
Thanks,
Kotresh
Updated by John Mulligan almost 2 years ago
We're looking into getting that for you, thanks.
Updated by Kotresh Hiremath Ravishankar almost 2 years ago
Hi John,
Any update on this ?
Updated by John Mulligan almost 2 years ago
- File mgr.log.gz mgr.log.gz added
Hi there, sorry for delays, this was very tricky to get info on as it did not reproduce outside of our CI. So it required a bit of hackery.
I made a small patch to the metadata_manager.py to capture the state of the file, like so:
diff --git a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
index e6bcb718d20..1268ba49adf 100644
--- a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
+++ b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
@@ -42,6 +42,11 @@ class MetadataManager(object):
else:
self.config = configparser.SafeConfigParser()
+ def _capture(self, f, *args, **kwargs):
+ ret = f(*args, **kwargs)
+ log.info('CAPTURED %r', ret)
+ return ret
+
def refresh(self):
fd = None
conf_data = StringIO()
@@ -49,7 +54,7 @@ class MetadataManager(object):
log.debug("opening config {0}".format(self.config_path))
fd = self.fs.open(self.config_path, os.O_RDONLY)
while True:
- data = self.fs.read(fd, -1, MetadataManager.MAX_IO_BYTES)
+ data = self._capture(self.fs.read, fd, -1, MetadataManager.MAX_IO_BYTES)
if not len(data):
break
conf_data.write(data.decode('utf-8'))
I've attached the full log with the failure. The relevant part looks like:
2022-06-28T13:08:39.169+0000 7f27f001d700 0 log_channel(audit) log [DBG] : from='client.4391 -' entity='client.admin' cmd=[{"clone_name":"babydino2","format":"json","group_name":"Park","prefix":"fs clone cancel","vol_name":"cephfs"}]: dispatch 2022-06-28T13:08:39.169+0000 7f28057d3700 0 [volumes INFO volumes.module] Starting _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "" 2022-06-28T13:08:39.173+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] cloning to subvolume path: /volumes/Park/babydino2 2022-06-28T13:08:39.173+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] starting clone: (cephfs, Park, babydino2) 2022-06-28T13:08:39.177+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.177+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.181+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.181+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.185+0000 7f27c2f70700 0 [volumes INFO volumes.fs.async_cloner] Delayed cloning (cephfs, Park, babydino2) -- by 0 seconds 2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.189+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.189+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n' 2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n' 2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'' 2022-06-28T13:08:39.205+0000 7f27c2f70700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] wrote 203 bytes to config b'/volumes/Park/babydino2/.meta' 2022-06-28T13:08:39.205+0000 7f28057d3700 0 [volumes ERROR volumes.module] Failed _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "": Traceback (most recent call last): File "/usr/share/ceph/mgr/volumes/module.py", line 38, in wrap return f(self, inbuf, cmd) File "/usr/share/ceph/mgr/volumes/module.py", line 784, in _cmd_fs_clone_cancel vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None)) File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 730, in clone_cancel self.cloner.cancel_job(volname, (clonename, groupname)) File "/usr/share/ceph/mgr/volumes/fs/async_cloner.py", line 401, in cancel_job with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__ return next(self.gen) File "/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py", line 73, in open_subvol subvolume.open(op_type) File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v2.py", line 301, in open self.metadata_mgr.refresh() File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 62, in refresh self.config.readfp(conf_data) File "/usr/lib64/python3.6/configparser.py", line 763, in readfp self.read_file(fp, source=filename) File "/usr/lib64/python3.6/configparser.py", line 718, in read_file self._read(f, source) File "/usr/lib64/python3.6/configparser.py", line 1111, in _read raise e configparser.ParsingError: Source contains parsing errors: '<???>' [line 13]: 'a0\n' 2022-06-28T13:08:39.205+0000 7f28057d3700 -1 mgr handle_command module 'volumes' command handler threw exception: Source contains parsing errors: '<???>' [line 13]: 'a0\n'
If you look carefully at the "CAPTURED" lines I added you can see a line:
2022-06-28T13:08:39.201+0000 7f28057d3700 0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'
This appears to be the cause of the problem. I think the read method is ending up returning the last 4 bytes of the file a second time. Perhaps there's an issue with the underlying read method?
Anyway, I hope this helps. Let me know if you need more debugging from our side as I'm now set up to test patched versions of the mgr in the go-ceph CI and gather the resulting mgr logs if needed.
Updated by John Mulligan almost 2 years ago
I think I have a fix for this issue. I'm working on verifying it for go-ceph. If that all goes well I be putting together a PR in the next few days.
Updated by Kotresh Hiremath Ravishankar almost 2 years ago
This is really interesting. Waiting for you PR to understand in what scenario this can happen.
Updated by John Mulligan almost 2 years ago
Draft PR: https://github.com/ceph/ceph/pull/47067
Updated by Venky Shankar almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 47067
Updated by Venky Shankar over 1 year ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 1 year ago
- Copied to Backport #57112: quincy: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" added
Updated by Backport Bot over 1 year ago
- Copied to Backport #57113: pacific: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" added
Updated by Konstantin Shalygin 9 months ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100