Project

General

Profile

Bug #58090

Non-existent pending clone shows up in snapshot info

Added by Sebastian Hasler 2 months ago. Updated about 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
mgr/volumes
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph version: v17.2.5

My CephFS somehow got in a state where a snapshot has a pending clone, but the pending clone doesn't exist. (This is problematic, because the pending clone prevents me from being able to delete the snapshot.)

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
Error ENOENT: subvolume 'csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec' does not exist

I think the CephFS got in this state when the clone failed due to insufficient disk space. This was already some time ago with an older version of Ceph. It might or might not have been fixed in the meantime.

The point of this ticket is that CephFS should be able to recover from this state, but currently that seems to not be the case.

To try to recover from this state, I had the idea to re-create the clone with that exact name and then cancel it.

$ ceph fs subvolume --group_name=csi snapshot clone ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec --target_group_name=csi

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
{
  "status": {
    "state": "in-progress",
    "source": {
      "volume": "ssd-fs",
      "subvolume": "csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e",
      "snapshot": "csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e",
      "group": "csi" 
    }
  }
}

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        },
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

$ ceph fs clone --group_name=csi cancel ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec

$ ceph fs clone --group_name=csi status ssd-fs csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec
{
  "status": {
    "state": "canceled",
    "source": {
      "volume": "ssd-fs",
      "subvolume": "csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e",
      "snapshot": "csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e",
      "group": "csi" 
    },
    "failure": {
      "errno": "4",
      "error_msg": "user interrupted clone operation" 
    }
  }
}

$ ceph fs subvolume --group_name=csi snapshot info ssd-fs csi-vol-9ce73497-1be0-11ec-88f1-e6360fd42c9e csi-snap-cd27f06b-4fbb-11ec-978d-8af73a17386e
{
    "created_at": "2021-11-27 19:54:16.134448",
    "data_pool": "ssd-fs-data0",
    "has_pending_clones": "yes",
    "pending_clones": [
        {
            "name": "csi-vol-ff687f29-4fbd-11ec-830e-6ed86f62d6ec",
            "target_group": "csi" 
        }
    ]
}

However, as you can see, re-creating the clone leads to a duplicate entry in the `pending_clones` list, and cancellation of the clone just removes one of those two entries. So there's still the pending clone which I don't get rid of, so I cannot delete the snapshot.

History

#1 Updated by Venky Shankar 2 months ago

Hi Sebastian,

There is a stray index causing this issue. Could you list the contents of `/volumes/_index/clone/` (under cephfs mount point).

#2 Updated by Sebastian Hasler 2 months ago

Now the snapshot is deleted (finally). From the logs of our CSI provisioner, it seems that the snapshot was deleted shortly after I created this issue. So I guess the re-creation and cancellation of the clone did have an effect, just slightly delayed.

#3 Updated by Sebastian Hasler 2 months ago

The `/volumes/_index/clone/` directory is empty, by the way. But that's after the snapshot was deleted successfully. I don't know how this directory looked like for the previous year where the CSI provisioner continuously tried to delete this snapshot (and failed due to (non-existent) pending clones).

#4 Updated by Venky Shankar about 2 months ago

Sebastian Hasler wrote:

The `/volumes/_index/clone/` directory is empty, by the way. But that's after the snapshot was deleted successfully. I don't know how this directory looked like for the previous year where the CSI provisioner continuously tried to delete this snapshot (and failed due to (non-existent) pending clones).

Most likely there would have been a dangling symlink in that directory. We have seen this before and one can run into it when there is insufficient disk-space.

#5 Updated by Venky Shankar about 2 months ago

  • Assignee set to Rishabh Dave
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Component(FS) mgr/volumes added

Rishabh, please take a look at this. I think the dangling symlink can be gracefully handled by deleting it.

Also available in: Atom PDF