Project

General

Profile

Actions

Bug #58090

open

Non-existent pending clone shows up in snapshot info

Added by Sebastian Hasler over 1 year ago. Updated 30 days ago.

Status:
New
Priority:
Normal
Category:
fsck/damage handling
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph version: v17.2.5

My CephFS somehow got in a state where a snapshot has a pending clone, but the pending clone doesn't exist. (This is problematic, because the pending clone prevents me from being able to delete the snapshot.)

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
Error ENOENT: subvolume 'csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec' does not exist

I think the CephFS got in this state when the clone failed due to insufficient disk space. This was already some time ago with an older version of Ceph. It might or might not have been fixed in the meantime.

The point of this ticket is that CephFS should be able to recover from this state, but currently that seems to not be the case.

To try to recover from this state, I had the idea to re-create the clone with that exact name and then cancel it.

$ ceph fs subvolume --group_name=csi snapshot clone ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec --target_group_name=csi

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
{
  "status": {
    "state": "in-progress",
    "source": {
      "volume": "ssd-fs",
      "subvolume": "csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e",
      "snapshot": "csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e",
      "group": "csi" 
    }
  }
}

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        },
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

$ ceph fs clone --group_name=csi cancel ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
{
  "status": {
    "state": "canceled",
    "source": {
      "volume": "ssd-fs",
      "subvolume": "csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e",
      "snapshot": "csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e",
      "group": "csi" 
    },
    "failure": {
      "errno": "4",
      "error_msg": "user interrupted clone operation" 
    }
  }
}

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

However, as you can see, re-creating the clone leads to a duplicate entry in the `pending_clones` list, and cancellation of the clone just removes one of those two entries. So there's still the pending clone which I don't get rid of, so I cannot delete the snapshot.

Actions #1

Updated by Venky Shankar over 1 year ago

Hi Sebastian,

There is a stray index causing this issue. Could you list the contents of `/volumes/_index/clone/` (under cephfs mount point).

Actions #2

Updated by Sebastian Hasler over 1 year ago

Now the snapshot is deleted (finally). From the logs of our CSI provisioner, it seems that the snapshot was deleted shortly after I created this issue. So I guess the re-creation and cancellation of the clone did have an effect, just slightly delayed.

Actions #3

Updated by Sebastian Hasler over 1 year ago

The `/volumes/_index/clone/` directory is empty, by the way. But that's after the snapshot was deleted successfully. I don't know how this directory looked like for the previous year where the CSI provisioner continuously tried to delete this snapshot (and failed due to (non-existent) pending clones).

Actions #4

Updated by Venky Shankar over 1 year ago

Sebastian Hasler wrote:

The `/volumes/_index/clone/` directory is empty, by the way. But that's after the snapshot was deleted successfully. I don't know how this directory looked like for the previous year where the CSI provisioner continuously tried to delete this snapshot (and failed due to (non-existent) pending clones).

Most likely there would have been a dangling symlink in that directory. We have seen this before and one can run into it when there is insufficient disk-space.

Actions #5

Updated by Venky Shankar over 1 year ago

  • Assignee set to Rishabh Dave
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Component(FS) mgr/volumes added

Rishabh, please take a look at this. I think the dangling symlink can be gracefully handled by deleting it.

Actions #6

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions #7

Updated by Venky Shankar 3 months ago

  • Assignee changed from Rishabh Dave to Neeraj Pratap Singh

Neeraj, please take this one.

Actions #8

Updated by Kotresh Hiremath Ravishankar about 2 months ago

Neeraj and I had a discussion regarding this.

We fixed a bunch of issues around clones and dangling index symlinks, so I think this issue should not occur anymore. But we do require a mechanism to get out of this situation if this was occurred in older versions.
I think we can check and clear the dangling symlinks in snapshot info command.

Thanks,
Kotresh H R

Actions #9

Updated by Dhairya Parmar 30 days ago

  • Pull request ID set to 55838
Actions

Also available in: Atom PDF