Bug #55583: Intermittent ParsingError failure in mgr/volumes module during "clone cancel" - CephFS - Ceph

Actions

Copy link

Bug #55583

closed

Intermittent ParsingError failure in mgr/volumes module during "clone cancel"

Added by John Mulligan almost 2 years ago. Updated 9 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Kotresh Hiremath Ravishankar

Category:

Administration/Usability

Target version:

Ceph - v18.0.0

% Done:

100%

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Yes

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

mgr/volumes

Labels (FS):

Pull request ID:

47067

Crash signature (v1):

Crash signature (v2):

Description

This issue is a bit difficult to explain so please bear with me.

On quincy, but not on octopus or pacific, a test runs that attempt to cancel a clone. Sometimes when this test runs it fails with an python traceback in the mgr response. This traceback indicates a configparser.ParsingError was raised.

rados: ret=-22, Invalid argument: "Traceback (most recent call last):\n
  File \"/usr/share/ceph/mgr/mgr_module.py\", line 1701, in _handle_command\n
    return self.handle_command(inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 409, in handle_command\n
    return handler(inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 38, in wrap\n
    return f(self, inbuf, cmd)\n
  File \"/usr/share/ceph/mgr/volumes/module.py\", line 636, in _cmd_fs_clone_cancel\n
    vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))\n
  File \"/usr/share/ceph/mgr/volumes/fs/volume.py\", line 582, in clone_cancel\n
    self.cloner.cancel_job(volname, (clonename, groupname))\n
  File \"/usr/share/ceph/mgr/volumes/fs/async_cloner.py\", line 389, in cancel_job\n
    with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:\n
  File \"/usr/lib64/python3.6/contextlib.py\", line 81, in __enter__\n
    return next(self.gen)\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py\", line 72, in open_subvol\n
    subvolume = loaded_subvolumes.get_subvolume_object(mgr, fs, vol_spec, group, subvolname)\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/__init__.py\", line 95, in get_subvolume_object\n
    subvolume.discover()\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_base.py\", line 319, in discover\n
    self.metadata_mgr.refresh()\n
  File \"/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py\", line 52, in refresh\n
    self.config.readfp(conf_data)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 763, in readfp\n
    self.read_file(fp, source=filename)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 718, in read_file\n
    self._read(f, source)\n
  File \"/usr/lib64/python3.6/configparser.py\", line 1111, in _read\n
    raise e\nconfigparser.ParsingError: Source contains parsing errors: '<???>'\n\t[line 13]: 'a0\\n'\n"

(I've added newlines back to the traceback to make it more readable. It's possible I missed some)

The automated test is part of the go-ceph project. We're tracking the issue on our side here:
https://github.com/ceph/go-ceph/issues/679

Since this one has been hard for me to track down I'm filing the issue with the express request that I can get some additional hints on how to debug this further.

One additional challenge is that I can not reproduce this on my desktop system, but it does occur regularly in our CI. We're blocked officially supporting quincy until this test, as well as some others, are better understood. Thanks!

Files

mgr.log.gz (17.4 KB) mgr.log.gz

John Mulligan, 06/28/2022 01:25 PM

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by John Mulligan almost 2 years ago

This appears to be affecting ceph pacific now with the v16.2.9 release.

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Assignee set to Kotresh Hiremath Ravishankar
Target version set to v18.0.0
Backport set to quincy, pacific
Severity changed from 3 - minor to 2 - major

Kotresh, please take a look.

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar almost 2 years ago

Hi John,

This looks like a race. If this reproduces consistently with the test suite, could you please share the config file located at '/mnt/volumes/<groupname>/<subvolumename>/.meta' ?

Thanks,
Kotresh

Actions

Copy link

Updated by John Mulligan almost 2 years ago

We're looking into getting that for you, thanks.

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar almost 2 years ago

Hi John,

Any update on this ?

Actions

Copy link

Updated by John Mulligan almost 2 years ago

File mgr.log.gz mgr.log.gz added

Hi there, sorry for delays, this was very tricky to get info on as it did not reproduce outside of our CI. So it required a bit of hackery.

I made a small patch to the metadata_manager.py to capture the state of the file, like so:

diff --git a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
index e6bcb718d20..1268ba49adf 100644
--- a/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
+++ b/src/pybind/mgr/volumes/fs/operations/versions/metadata_manager.py
@@ -42,6 +42,11 @@ class MetadataManager(object):
         else:
             self.config = configparser.SafeConfigParser()

+    def _capture(self, f, *args, **kwargs):
+        ret = f(*args, **kwargs)
+        log.info('CAPTURED %r', ret)
+        return ret
+
     def refresh(self):
         fd = None
         conf_data = StringIO()
@@ -49,7 +54,7 @@ class MetadataManager(object):
             log.debug("opening config {0}".format(self.config_path))
             fd = self.fs.open(self.config_path, os.O_RDONLY)
             while True:
-                data = self.fs.read(fd, -1, MetadataManager.MAX_IO_BYTES)
+                data = self._capture(self.fs.read, fd, -1, MetadataManager.MAX_IO_BYTES)
                 if not len(data):
                     break
                 conf_data.write(data.decode('utf-8'))

I've attached the full log with the failure. The relevant part looks like:

2022-06-28T13:08:39.169+0000 7f27f001d700  0 log_channel(audit) log [DBG] : from='client.4391 -' entity='client.admin' cmd=[{"clone_name":"babydino2","format":"json","group_name":"Park","prefix":"fs clone cancel","vol_name":"cephfs"}]: dispatch
2022-06-28T13:08:39.169+0000 7f28057d3700  0 [volumes INFO volumes.module] Starting _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "" 
2022-06-28T13:08:39.173+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] cloning to subvolume path: /volumes/Park/babydino2
2022-06-28T13:08:39.173+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] starting clone: (cephfs, Park, babydino2)
2022-06-28T13:08:39.177+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.177+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.181+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.181+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.185+0000 7f27c2f70700  0 [volumes INFO volumes.fs.async_cloner] Delayed cloning (cephfs, Park, babydino2) -- by 0 seconds
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.189+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'[GLOBAL]\nversion = 2\ntype = clone\npath = /volumes/Park/babydino2/8835d0dd-e2c1-42c7-95ab-5230de39df90\nstate = pending\n\n[source]\nvolume = cephfs\ngroup = Park\nsubvolume = Jurrasic\nsnapshot = dinodna0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'
2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b''
2022-06-28T13:08:39.205+0000 7f27c2f70700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] wrote 203 bytes to config b'/volumes/Park/babydino2/.meta'
2022-06-28T13:08:39.205+0000 7f28057d3700  0 [volumes ERROR volumes.module] Failed _cmd_fs_clone_cancel(clone_name:babydino2, format:json, group_name:Park, prefix:fs clone cancel, vol_name:cephfs) < "":
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/module.py", line 38, in wrap
    return f(self, inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 784, in _cmd_fs_clone_cancel
    vol_name=cmd['vol_name'], clone_name=cmd['clone_name'], group_name=cmd.get('group_name', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 730, in clone_cancel
    self.cloner.cancel_job(volname, (clonename, groupname))
  File "/usr/share/ceph/mgr/volumes/fs/async_cloner.py", line 401, in cancel_job
    with open_subvol(self.fs_client.mgr, fs_handle, self.vc.volspec, group, clonename, SubvolumeOpType.CLONE_CANCEL) as clone_subvolume:
  File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/volumes/fs/operations/subvolume.py", line 73, in open_subvol
    subvolume.open(op_type)
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/subvolume_v2.py", line 301, in open
    self.metadata_mgr.refresh()
  File "/usr/share/ceph/mgr/volumes/fs/operations/versions/metadata_manager.py", line 62, in refresh
    self.config.readfp(conf_data)
  File "/usr/lib64/python3.6/configparser.py", line 763, in readfp
    self.read_file(fp, source=filename)
  File "/usr/lib64/python3.6/configparser.py", line 718, in read_file
    self._read(f, source)
  File "/usr/lib64/python3.6/configparser.py", line 1111, in _read
    raise e
configparser.ParsingError: Source contains parsing errors: '<???>'
    [line 13]: 'a0\n'

2022-06-28T13:08:39.205+0000 7f28057d3700 -1 mgr handle_command module 'volumes' command handler threw exception: Source contains parsing errors: '<???>'
    [line 13]: 'a0\n'

If you look carefully at the "CAPTURED" lines I added you can see a line:

2022-06-28T13:08:39.201+0000 7f28057d3700  0 [volumes INFO volumes.fs.operations.versions.metadata_manager] CAPTURED b'a0\n\n'

This appears to be the cause of the problem. I think the read method is ending up returning the last 4 bytes of the file a second time. Perhaps there's an issue with the underlying read method?

Anyway, I hope this helps. Let me know if you need more debugging from our side as I'm now set up to test patched versions of the mgr in the go-ceph CI and gather the resulting mgr logs if needed.

Actions

Copy link

Updated by John Mulligan almost 2 years ago

I think I have a fix for this issue. I'm working on verifying it for go-ceph. If that all goes well I be putting together a PR in the next few days.

Actions

Copy link

Status changed from Pending Backport to Resolved
% Done changed from 0 to 100

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #55583

Intermittent ParsingError failure in mgr/volumes module during "clone cancel"

Updated by John Mulligan almost 2 years ago

Updated by Venky Shankar almost 2 years ago

Updated by Kotresh Hiremath Ravishankar almost 2 years ago

Updated by John Mulligan almost 2 years ago

Updated by Kotresh Hiremath Ravishankar almost 2 years ago

Updated by John Mulligan almost 2 years ago

Updated by John Mulligan almost 2 years ago

Updated by Kotresh Hiremath Ravishankar almost 2 years ago

Updated by John Mulligan almost 2 years ago

Updated by Venky Shankar almost 2 years ago

Updated by Venky Shankar over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Backport Bot over 1 year ago

Updated by Konstantin Shalygin 9 months ago