_scan_snaps no head for <object>
Ceph Version: ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
Deployed via Rook on Kubernetes: rook/rook:v0.7.1
Client: Proxmox 5.2-6
We are running a Rook/Ceph dedicated Kubernetes Cluster. This cluster has:
- 3 Nodes
- each nodes has 4 Disks provisioned with bluestore
- each node runs on HostOSD created by Rook which manages itself one OSD per disk.
- So we have an overall of 3*4=12 OSDs
- kubernetes (replication: 3)
- kubernetes-volatile (replication: 0)
- proxmox (replication: 3)
- promox-volatile (replication: 0)
We are hosting on the volatile storage applications, which are clustered itself like:
We are using this Cluster as a Databackend for our Application Kubernetes Cluster and for our Proxmox. Therefore we have 4 rbd pools:
- Univention Company Server
So the Idea was to spare disk and network traffic by leaving the replication apart.
Our hostingprovider had to exchange optical fibers between one of our ceph server and a switch.
During this procedure it seems, that something broke in at least our proxmox-volatile pool.
A few times a day (when the osd is scrubbing) it gets log messages like:
2019-01-22 13:07:46.493942 I | osd11: 2019-01-22 13:07:46.493794 7ff91bb90700 -1 osd.11 pg_epoch: 6064 pg[3.2c( v 6064'291018055 (6043'291016552,6064'291018055] local-lis/les=6062/6063 n=8278 ec=45/45 lis/c 6062/6062 les/c/f 6063/6063/0 6062/6062/6062)  r=0 lpr=6062 lua=6064'291018054 crt=6064'291018055 lcod 6064'291018054 mlcod 6064'291018053 active+clean+scrubbing+deep] _scan_snaps no head for 3:37fc2cc6:::rbd_data.1c085e238e1f29.000000000000185f:4 (have MIN)
When those messages are appearing, the according osd stalls and gets marked unhealthy by the other osds. Also services with volumes/images on volatile are getting stalled as well.
We found a few docs how to fix this with a filesystem as ceph backend but none with bluestore. Also as we are running Ceph via Rook its not trivial (and I don't know how), to turn of an osd to use the ceph toolbox.
What we are doing right now, is getting the object ids from the logs and trying to delete them via rados cli.
So two things:
1. What is actually the best way to fix this?
2. Is there a way to make the OSDs more resilent so it is able to scrub, identify those errors but keep available.
I found many references saying that this error "actually never happens" but I guess this is not true in any cases.
#1 Updated by Alwin Mark 3 months ago
Ok Good News!
I found the Problem. After reading the awesome Blog Entry of https://www.sebastien-han.fr/blog/2012/07/16/rbd-objects/ I made a script scraping through the osd scrubbing errors and compare them to the block_name_prefix. I found out that all broken objects belong to two images and both images are assigned to one Proxmox VM, which had issues after the fiber exchange and an upgrade. We did there a rbd rollback after we snapshotted the broken image to recover to it in worst case. Now we found out, that this snapshot is broken. (One indicator was that there were no watchers on the object, which is hopefully only the case when there are also no watcher on the image/snapshot).
So deleting the snapshot makes the scrubbing process run correctly again without getting stuck.
Nevertheless this must not happen ever again and is hopefully avoidable, so I'd like to keep this Ticket open to fix not the root cause of this problem but increase the fault tolerance and debug output.
So what would helped me whould be:
1. OSD don't get stuck during scrubbing. If that wouldn't happen, I would have a much better feeling porting more and more services onto ceph. In best cases, when the OSD recognize that there are broken objects, he should try to isolate them/ignore them, whatever is needed to not affect the whole cluster.
2. So what should happen in my opinion is that those PGs having broken objects should be marked and shown in the status view as broken.
3. Also it would be useful to show the according image to those objects, as I didn't had to understand the whole architecture and cli commands to find that bad snapshot. Also it would help people to fix the problem in worst circumstances by deleting that image. Its still better then loosing a whole cluster (or in our case two cluster (Data + App) and lots of our VMs during an outage.