Bug #38135
openCeph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task.
0%
Description
We observe Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task. Here are the environments and steps:
1, The ceph cluster has 108 OSDs.
2, Create a pool with 2048 PGs.
3, Generate 500K RBDs in the pool, each RBD is 20G
4, After the ceph performs some deep-scrub, the cluster is in HEALTH_OK status
5, Create snapshots for those RBDs, total snapshots are around 1.2M.
6, Make sure the Ceph cluster is in HEALTH_OK
7, Randomly creating and removing snapshot in parallel. We have about 6 clients do the creating/removing
8, We observe some snaptrim_wait, after about 12 hrs, we got about 3 inconsistent PGs.
Comparing to Ceph 12, we have 100K RBDs, with about 2M snapshots, we only get 1 inconsistent PG, with some crashed OSDs.
If need more details, please kindly let me know and I am happy to provide the test script and the detail.
Files
Updated by Bengen Tan about 5 years ago
- File create_rbd.sh create_rbd.sh added
- File create_snapshot.sh create_snapshot.sh added
- File delete_random_snapshot.sh delete_random_snapshot.sh added
- File snapshot_action.sh snapshot_action.sh added
1, create_rbd.sh, this is for creating rbds
2, create_snapshot.sh, this is for creating snapshots
3, delete_random_snapshot.sh, this is for deleting random snapshots
4, snapshot_action.sh, this performs creating and deleting snapshots in parallel.
Updated by Greg Farnum about 5 years ago
- Project changed from Ceph to RADOS
- Category changed from common to Snapshots