Bug #63493
open
Problem with Pgs Deep-scrubbing ceph
Added by Abu Sayed 6 months ago.
Updated 6 months ago.
Description
Hi ,
We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."
Today, two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.
We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.
Abu Sayed wrote:
Hi ,
We operate a Ceph cluster running the Octopus version (latest 15.2.17). The setup includes 13 hosts, totaling 107 OSDs contributing to a storage capacity of 259TB, with disk usage at 41%. Total 2057 PG, 58.8 PG per OSD, 9.4M object .
We're currently encountering a HEALTH_WARN issue, specifically *"PG_NOT_DEEP_SCRUBBED: 632 PGs not deep-scrubbed in time."
Today, two OSDs have gone down out of the 107, leaving 105 OSDs* still operational. Upon enabling the "No Deep Scrub" flag and initiating data rebalancing, we're facing frequent OSD downtimes. This results in higher OSD latency and the generation of slow ops send logs like"" For instance, a health check update reports: "6 slow ops, oldest one blocked for 43 sec, daemons [osd.35,osd.67,osd.97] have slow ops.""" Additionally, some VMs in the cluster are experiencing intermittent shutdowns.
We're contemplating resolving the 632 PGs that haven't undergone deep scrubbing without causing any adverse impact on the VMs existing in this Ceph cluster. Any guidance on achieving this without affecting VM performance would be greatly appreciated.
Also available in: Atom
PDF