Bug #63493
openProblem with Pgs Deep-scrubbing ceph
0%
Description
Hi ,
We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."
Today, two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.
We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.
Updated by Abu Sayed 6 months ago
Abu Sayed wrote:
We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."Hi ,
We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. Total 2057 PG, 58.8 PG per OSD, 9.4M object .
Today, two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.