Project

General

Profile

Bug #58062

RBD tasks will stop if a pool is deleted, blocking further queue

Added by Miodrag Prelec over 1 year ago. Updated over 1 year ago.

Status:
Duplicate
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

I believe we found a bug with RBD tasks in Ceph MGR, dating somewhere between versions 16.2.0 and 17.2.1 (these are the versions I con confirm so far have the bug).

Bug description:
When RBD trash removal task exists for a specific image and a pool owning that image is deleted it will block any other tasks and start filling up the queue.

High level observation:
We have a specific workload on Ceph using mostly RBD to create and consume PersistentVolumes in Kubernetes. Since we provide mostly testing lab/environments/clusters there's a significant amount of creating and tearing down Namespaces, Clusters and subsequently PersistentVolumes as well. With this workload, which was working golden solid in older versions of Ceph for years (we were using 12.2.11 previously), in the past couple of months (after creating new clusters - Pacific, then Quincy) we've hit a number of different bugs. We were suspecting hardware failures since first indication was OSD-s' slow operations warning, but after creating a cluster in another data center and finding ourselves with the same issues on different types of disks and server vendors we could confirm hardware was not the issue.

After one global outage connected to sporadic power loss in the cabinets we started improving our setup in terms of resilience, observability, hardware capacity and analyzing workloads. This helped, but every now and then we would be confronted with these types of issues:
  • OSD slow operations
  • Manager daemon being unresponsive * Manager daemon memory consumption rising to high levels in short periods of time (from 1GB to ~8GB in 12 hours, highest we've seen) * Sometimes, at 1GB-4GB memory consumption MGR would start to hang causing inability to detect main issue in the first place * Some commands would work that are not directly connected to MGR (I suppose), if needed I can reproduce in more detail
  • Read/Write operations blocked

We could always match these issues with a lot of images being left in trash and not being cleaned up. Of course, since MGR hangs at this point it was really hard to debug any further.

Root cause(s)
POOLS - Although increasing complexity and computation needs (we have hardware to support it), having pools created for each cluster really makes sense for us due to the types of workloads mentioned before. It's a lot easier (for us, for historical reasons if no other) to handle observability, quotas and other tools to handle K8s cluster <-> pool logic.
AUTOMATION - Once we gave users ability to create, remove or upgrade their own clusters as needed, we've started hitting this bug. Main issue was K8s Cluster and consequently pool deletion, since rarely any users bothered to undeploy software (PersistentVolumes/RBD images) before tearing down the K8s Cluster (consequently Ceph pools). This caused images being left in trash, especially on Fridays when users would massively start creating and tearing down K8s Clusters/Namespaces due to over weekend tests. Yaaay, weekend work.
REPETITION - If pool is deleted and some images still exist in trash, new pool created with the same name in the meantime, CephX will complain not having permissions to delete the image.
PROTECTED IMAGES - Our users also do so called "timeshift" tests - changing environments dating into the future to test workloads under i.e. certificate expiry. If an image is protected into the future, deleted from Kubernetes (i.e. forcefully) but not from Ceph it will most likely trigger this bug.

Workaround(s)
Enabling (where applicable) RBD image features like fast-diff and object-map helped, since RBD images could be deleted a lot faster with fast-diff enabled. Of course, we also went and stopped deleting the pools completely which finally solved the issue.
With deeper caution on Fridays we found out that at some point we would get these high increases in ceph MGR memory consumption and a lot of images being left in trash, piling up and never to get cleaned up. Since MGR was still responsive at this point we could see that ceph rbd task list would return huge number of pending tasks, 280k of them pending was our highest value detected. Canceling locked tasks would fix the issue and allow queue to be slowly cleaned. We also intentionally failed MGR daemon to speed up the process.

Logs
_[ {
"id": "cf6db76e-cf21-44d7-b223-521938694567",
"message": "Removing image $POOL_NAME/525660e89545eb from trash",
"refs": {
"action": "trash remove",
"image_id": "525660e89545eb",
"pool_name": "$POOL_NAME",
"pool_namespace": ""
},
"sequence": 558272
}, {
"id": "d56edbc5-b6e4-4f79-aaec-f7e8e2702959",
"message": "Removing image $POOL_NAME/525660f8a986b2 from trash",
"refs": {
"action": "trash remove",
"image_id": "525660f8a986b2",
"pool_name": "$POOL_NAME",
"pool_namespace": ""
},
"sequence": 558273
},_

_ {
"id": "79e23f6a-f50f-41ec-95b4-73bcd9af1fed",
"message": "Removing image $POOL_NAME/4a6d36df0c85ba from trash",
"refs": {
"action": "trash remove",
"image_id": "4a6d36df0c85ba",
"pool_name": "$POOL_NAME",
"pool_namespace": ""
},
"retry_attempts": 1306,
"retry_message": "[errno 1] RBD permission error (error deleting image from trash)",
"retry_time": "2022-10-10T05:33:04.643126",
"sequence": 508979
},_
...

How to reproduce
1. Create some pools
2. Create a bunch of images in each one of them (I used PV from K8s, but I suspect manually creating them would work as well)
3. Use a for loop or some other method to put them all in trash and delete the pools at the same time

When reproducing the bug, I've used imageFeatures: Layering in storage class definition for this purpose since it takes a lot longer for an RBD image to be deleted.
I've managed to reproduce it in 100% cases so far using this method.

Environment information
I'm pretty sure this bug is HW or environment agnostic, but here's our specs and versions if it helps someone.
One Ceph cluster with 11 hosts (v17.2.1), all Dell PowerEdge R640 Gen9 with 39 800GB/2TiB consumer grade SSD disks altogether (each one 56 logical cores Xeon Gold 6132, 192GB RAM ...).
One Ceph cluster with 6 hosts (v17.2.0), all HPE ProLiant DL360 Gen10 with 2 6TiB NVME disks per host used for OSD (each server 48 logical cores Xeon Gold 6252, 768 GiB RAM ...).
Each cluster has 2x10GB interfaces in bonding for public access and 2x10GB interfaces in bonding for private_network.
We use ceph-csi (v3.6.1) to create and delete RBD images from Kubernetes. We use Podman 3.4.44 to containerize Ceph daemons/services.

Please let me know if I can provide more information or any way I can help to solve this one.

Thank you in advance,
best regards,
Miodrag


Related issues

Duplicates rbd - Bug #52932: [rbd_support] pool removal can wedge the task queue Resolved

History

#1 Updated by Ilya Dryomov over 1 year ago

  • Status changed from New to Duplicate
  • Assignee set to Ilya Dryomov

Hi Miodrag,

Thanks for the report! This is a known issue, I'll bump the priority on the older ticket and we will try to resolve it ASAP.

#2 Updated by Ilya Dryomov over 1 year ago

  • Duplicates Bug #52932: [rbd_support] pool removal can wedge the task queue added

Also available in: Atom PDF