Project

General

Profile

Bug #55583

Intermittent ParsingError failure in mgr/volumes module during "clone cancel"

Added by John Mulligan 7 months ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue is a bit difficult to explain so please bear with me.

On quincy, but not on octopus or pacific, a test runs that attempt to cancel a clone. Sometimes when this test runs it fails with an python traceback in the mgr response. This traceback indicates a configparser.ParsingError was raised.

rados: ret=-22, Invalid argument: "Traceback (most recent call last):\n
  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1701, in _handle_command\n
    return self.handle_command(inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 409, in handle_command\n
    return handler(inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 38, in wrap\n
    return f(self, inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 636, in _cmd_fs_clone_cancel\n
    vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))\n
  File \"/usr/share/ceph/mgr/volumes/fs/volume.py\", line 582, in clone_cancel\n
    self.cloner.cancel_job(volname, (clonename, groupname))\n
  File \"/usr/share/ceph/mgr/volumes/fs/async_cloner.py\", line 389, in cancel_job\n
    with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:\n
  File \"/usr/lib64/python3.6/contextlib.py\", line 81, in __enter__\n
    return next(self.gen)\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py\", line 72, in open_subvol\n
    subvolume = loaded_subvolumes.get_subvolume_object(mgr, fs, vol_spec, group, subvolname)\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py\", line 95, in get_subvolume_object\n
    subvolume.discover()\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_base.py\", line 319, in discover\n
    self.metadata_mgr.refresh()\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py\", line 52, in refresh\n
    self.config.readfp(conf_data)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 763, in readfp\n
    self.read_file(fp, source=filename)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 718, in read_file\n
    self._read(f, source)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 1111, in _read\n
    raise e\nconfigparser.ParsingError: Source contains parsing errors: '<???>'\n\t[line 13]: 'a0\\n'\n" 

(I've added newlines back to the traceback to make it more readable. It's possible I missed some)

The automated test is part of the go-ceph project. We're tracking the issue on our side here:
https://github.com/ceph/go-ceph/issues/679

Since this one has been hard for me to track down I'm filing the issue with the express request that I can get some additional hints on how to debug this further.

One additional challenge is that I can not reproduce this on my desktop system, but it does occur regularly in our CI. We're blocked officially supporting quincy until this test, as well as some others, are better understood. Thanks!

mgr.log.gz (17.4 KB) John Mulligan, 06/28/2022 01:25 PM


Related issues

Copied to CephFS - Backport #57112: quincy: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" In Progress
Copied to CephFS - Backport #57113: pacific: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" In Progress

History

#1 Updated by John Mulligan 7 months ago

This appears to be affecting ceph pacific now with the v16.2.9 release.

#2 Updated by Venky Shankar 7 months ago

  • Assignee set to Kotresh Hiremath Ravishankar
  • Target version set to v18.0.0
  • Backport set to quincy, pacific
  • Severity changed from 3 - minor to 2 - major

Kotresh, please take a look.

#3 Updated by Kotresh Hiremath Ravishankar 6 months ago

Hi John,

This looks like a race. If this reproduces consistently with the test suite, could you please share the config file located at '/mnt/volumes/<groupname>/<subvolumename>/.meta' ?

Thanks,
Kotresh

#4 Updated by John Mulligan 6 months ago

We're looking into getting that for you, thanks.

#5 Updated by Kotresh Hiremath Ravishankar 6 months ago

Hi John,

Any update on this ?

#6 Updated by John Mulligan 5 months ago

Hi there, sorry for delays, this was very tricky to get info on as it did not reproduce outside of our CI. So it required a bit of hackery.

I made a small patch to the metadata_manager.py to capture the state of the file, like so:

diff --git a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
index e6bcb718d20..1268ba49adf 100644
--- a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
+++ b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
@@ -42,6 +42,11 @@ class MetadataManager(object):
         else:
             self.config = configparser.SafeConfigParser()

+    def _capture(self, f, *args, **kwargs):
+        ret = f(*args, **kwargs)
+        log.info('CAPTURED %r', ret)
+        return ret
+
     def refresh(self):
         fd = None
         conf_data = StringIO()
@@ -49,7 +54,7 @@ class MetadataManager(object):
             log.debug("opening config {0}".format(self.config_path))
             fd = self.fs.open(self.config_path, os.O_RDONLY)
             while True:
-                data = self.fs.read(fd, -1, MetadataManager.MAX_IO_BYTES)
+                data = self._capture(self.fs.read, fd, -1, MetadataManager.MAX_IO_BYTES)
                 if not len(data):
                     break
                 conf_data.write(data.decode('utf-8'))

I've attached the full log with the failure. The relevant part looks like:

2022-06-28T13:08:39.169+0000 7f27f001d700  0 log_channel(audit) log [DBG] : from='client.4391 -' entity='client.admin' cmd=[{"clone_name":"babydino2","format":"json","group_name":"Park","prefix":"fs clone cancel","vol_name":"cephfs"}]: dispatch
2022-06-28T13:08:39.169+0000 7f28057d3700  0 [volumes INFO volumes.module] Starting _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "" 
2022-06-28T13:08:39.173+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] cloning to subvolume path: /volumes/Park/babydino2
2022-06-28T13:08:39.173+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] starting clone: (cephfs, Park, babydino2)
2022-06-28T13:08:39.177+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.177+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.181+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.181+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] Delayed cloning (cephfs, Park, babydino2) -- by 0 seconds
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.205+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] wrote 203 bytes to config b'/volumes/Park/babydino2/.meta'
2022-06-28T13:08:39.205+0000 7f28057d3700  0 [volumes ERROR volumes.module] Failed _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "":
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/module.py", line 38, in wrap
    return f(self, inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 784, in _cmd_fs_clone_cancel
    vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 730, in clone_cancel
    self.cloner.cancel_job(volname, (clonename, groupname))
  File "/usr/share/ceph/mgr/volumes/fs/async_cloner.py", line 401, in cancel_job
    with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py", line 73, in open_subvol
    subvolume.open(op_type)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v2.py", line 301, in open
    self.metadata_mgr.refresh()
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 62, in refresh
    self.config.readfp(conf_data)
  File "/usr/lib64/python3.6/configparser.py", line 763, in readfp
    self.read_file(fp, source=filename)
  File "/usr/lib64/python3.6/configparser.py", line 718, in read_file
    self._read(f, source)
  File "/usr/lib64/python3.6/configparser.py", line 1111, in _read
    raise e
configparser.ParsingError: Source contains parsing errors: '<???>'
    [line 13]: 'a0\n'

2022-06-28T13:08:39.205+0000 7f28057d3700 -1 mgr handle_command module 'volumes' command handler threw exception: Source contains parsing errors: '<???>'
    [line 13]: 'a0\n'

If you look carefully at the "CAPTURED" lines I added you can see a line:

2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'

This appears to be the cause of the problem. I think the read method is ending up returning the last 4 bytes of the file a second time. Perhaps there's an issue with the underlying read method?

Anyway, I hope this helps. Let me know if you need more debugging from our side as I'm now set up to test patched versions of the mgr in the go-ceph CI and gather the resulting mgr logs if needed.

#7 Updated by John Mulligan 5 months ago

I think I have a fix for this issue. I'm working on verifying it for go-ceph. If that all goes well I be putting together a PR in the next few days.

#8 Updated by Kotresh Hiremath Ravishankar 5 months ago

This is really interesting. Waiting for you PR to understand in what scenario this can happen.

#10 Updated by Venky Shankar 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 47067

#11 Updated by Venky Shankar 4 months ago

  • Status changed from Fix Under Review to Pending Backport

#12 Updated by Backport Bot 4 months ago

  • Copied to Backport #57112: quincy: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" added

#13 Updated by Backport Bot 4 months ago

  • Copied to Backport #57113: pacific: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" added

#14 Updated by Backport Bot 4 months ago

  • Tags set to backport_processed

Also available in: Atom PDF