Project

General

Profile

Actions

Bug #38135

open

Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task.

Added by Bengen Tan over 5 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Snapshots
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observe Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task. Here are the environments and steps:
1, The ceph cluster has 108 OSDs.
2, Create a pool with 2048 PGs.
3, Generate 500K RBDs in the pool, each RBD is 20G
4, After the ceph performs some deep-scrub, the cluster is in HEALTH_OK status
5, Create snapshots for those RBDs, total snapshots are around 1.2M.
6, Make sure the Ceph cluster is in HEALTH_OK
7, Randomly creating and removing snapshot in parallel. We have about 6 clients do the creating/removing
8, We observe some snaptrim_wait, after about 12 hrs, we got about 3 inconsistent PGs.
Comparing to Ceph 12, we have 100K RBDs, with about 2M snapshots, we only get 1 inconsistent PG, with some crashed OSDs.
If need more details, please kindly let me know and I am happy to provide the test script and the detail.


Files

create_rbd.sh (5.63 KB) create_rbd.sh Bengen Tan, 02/01/2019 12:10 AM
create_snapshot.sh (7.63 KB) create_snapshot.sh Bengen Tan, 02/01/2019 12:10 AM
delete_random_snapshot.sh (2.02 KB) delete_random_snapshot.sh Bengen Tan, 02/01/2019 12:10 AM
snapshot_action.sh (3.48 KB) snapshot_action.sh Bengen Tan, 02/01/2019 12:10 AM
Actions

Also available in: Atom PDF