Project

General

Profile

Actions

Bug #38135

open

Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task.

Added by Bengen Tan about 5 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Snapshots
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We observe Ceph is in HEALTH_ERR status with inconsistent PG after some rbd snapshot creating/removing task. Here are the environments and steps:
1, The ceph cluster has 108 OSDs.
2, Create a pool with 2048 PGs.
3, Generate 500K RBDs in the pool, each RBD is 20G
4, After the ceph performs some deep-scrub, the cluster is in HEALTH_OK status
5, Create snapshots for those RBDs, total snapshots are around 1.2M.
6, Make sure the Ceph cluster is in HEALTH_OK
7, Randomly creating and removing snapshot in parallel. We have about 6 clients do the creating/removing
8, We observe some snaptrim_wait, after about 12 hrs, we got about 3 inconsistent PGs.
Comparing to Ceph 12, we have 100K RBDs, with about 2M snapshots, we only get 1 inconsistent PG, with some crashed OSDs.
If need more details, please kindly let me know and I am happy to provide the test script and the detail.


Files

create_rbd.sh (5.63 KB) create_rbd.sh Bengen Tan, 02/01/2019 12:10 AM
create_snapshot.sh (7.63 KB) create_snapshot.sh Bengen Tan, 02/01/2019 12:10 AM
delete_random_snapshot.sh (2.02 KB) delete_random_snapshot.sh Bengen Tan, 02/01/2019 12:10 AM
snapshot_action.sh (3.48 KB) snapshot_action.sh Bengen Tan, 02/01/2019 12:10 AM

Updated by Bengen Tan about 5 years ago

1, create_rbd.sh, this is for creating rbds
2, create_snapshot.sh, this is for creating snapshots
3, delete_random_snapshot.sh, this is for deleting random snapshots
4, snapshot_action.sh, this performs creating and deleting snapshots in parallel.

Actions #2

Updated by Greg Farnum about 5 years ago

  • Project changed from Ceph to RADOS
  • Category changed from common to Snapshots
Actions #3

Updated by Neha Ojha about 5 years ago

  • Priority changed from Normal to Urgent
Actions #4

Updated by Neha Ojha about 5 years ago

  • Priority changed from Urgent to High
Actions #5

Updated by Brad Hubbard over 4 years ago

  • Assignee set to Brad Hubbard
Actions #6

Updated by Neha Ojha over 2 years ago

  • Priority changed from High to Normal
Actions

Also available in: Atom PDF