Project

General

Profile

Actions

Bug #63493

open

Problem with Pgs Deep-scrubbing ceph

Added by Abu Sayed 6 months ago. Updated 6 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
11/09/2023
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi ,

We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."
Today,
two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.

We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.

Actions #1

Updated by Abu Sayed 6 months ago

Abu Sayed wrote:

Hi ,

We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. Total 2057 PG, 58.8 PG per OSD, 9.4M object .

We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."


Today,
two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.

We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.

Actions

Also available in: Atom PDF