Bug #38135
openCeph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task.
0%
Description
We observe Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task. Here are the environments and steps:
1, The ceph cluster has 108 OSDs.
2, Create a pool with 2048 PGs.
3, Generate 500K RBDs in the pool, each RBD is 20G
4, After the ceph performs some deep-scrub, the cluster is in HEALTH_OK status
5, Create snapshots for those RBDs, total snapshots are around 1.2M.
6, Make sure the Ceph cluster is in HEALTH_OK
7, Randomly creating and removing snapshot in parallel. We have about 6 clients do the creating/removing
8, We observe some snaptrim_wait, after about 12 hrs, we got about 3 inconsistent PGs.
Comparing to Ceph 12, we have 100K RBDs, with about 2M snapshots, we only get 1 inconsistent PG, with some crashed OSDs.
If need more details, please kindly let me know and I am happy to provide the test script and the detail.
Files