Project

General

Profile

Actions

Bug #58435

open

LRC cluster : Over 1000 PGs not deep-scrubbed in time

Added by Prashant D over 1 year ago. Updated over 1 year ago.

Status:
In Progress
Priority:
Normal
Assignee:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This tracker is to investigate and troubleshoot over 1000 PGs not deep-scrubbed in time.

"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)"

Actions #1

Updated by Prashant D over 1 year ago

  • Subject changed from LRC cluster : Over 1000 to LRC cluster : Over 1000 PGs not deep-scrubbed in time
  • Description updated (diff)
  • Assignee set to adam kraitman
  • Source set to Community (dev)
Actions #2

Updated by Prashant D over 1 year ago

@Adam DC949, I have assigned this tracker to you for now. Let's track the progress of PGs not deep-scrubbed issue through this tracker.

Kindly share the output of below commands for now :
1. ceph -s
2. ceph health detail

Actions #3

Updated by adam kraitman over 1 year ago

  • Status changed from New to In Progress

ceph -s
cluster:
id: 28f7427e-5558-4ffd-ae1a-51ec3042759a
health: HEALTH_WARN
1 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
2 MDSs report slow metadata IOs
2 MDSs report slow requests
1 MDSs behind on trimming
1793 pgs not deep-scrubbed in time
1807 pgs not scrubbed in time

services:
mon: 5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 15h)
mgr: reesi006.erytot(active, since 2w), standbys: reesi005.xxyjcw, reesi004.tplfrt
mds: 4/4 daemons up, 5 standby, 1 hot standby
osd: 166 osds: 166 up (since 12d), 166 in (since 9d); 30 remapped pgs
rgw: 2 daemons active (2 hosts, 1 zones)
tcmu-runner: 4 portals active (4 hosts)
data:
volumes: 4/4 healthy
pools: 24 pools, 2965 pgs
objects: 111.06M objects, 124 TiB
usage: 218 TiB used, 843 TiB / 1.0 PiB avail
pgs: 1682156/532372054 objects misplaced (0.316%)
2935 active+clean
30 active+remapped+backfilling
io:
client: 305 KiB/s rd, 580 KiB/s wr, 7 op/s rd, 6 op/s wr
progress:
Global Recovery Event (11d)
[===========================.] (remaining: 2h)
Actions #4

Updated by adam kraitman over 1 year ago

ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 clients failing to advance oldest client/flush tid; 2 MDSs report slow metadata IOs; 2 MDSs report slow requests; 1 MDSs behind on trimming; 1793 pgs not deep-scrubbed in time; 1807 pgs not scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability release
mds.teuthology.reesi004.dbioar(mds.0): Client teuthology:teuthology failing to respond to capability release client_id: 851189216
[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid
mds.teuthology.reesi004.dbioar(mds.0): Client teuthology:teuthology failing to advance its oldest client/flush tid. client_id: 851189216
[WRN] MDS_SLOW_METADATA_IO: 2 MDSs report slow metadata IOs
mds.cephfs.reesi002.euduff(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 118855 secs
mds.teuthology.reesi004.dbioar(mds.0): 100+ slow metadata IOs are blocked > 30 secs, oldest blocked for 121102 secs
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.cephfs.reesi002.euduff(mds.0): 1 slow requests are blocked > 30 secs
mds.teuthology.reesi004.dbioar(mds.0): 6 slow requests are blocked > 30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.teuthology.reesi004.dbioar(mds.0): Behind on trimming (1355/128) max_segments: 128, num_segments: 1355
[WRN] PG_NOT_DEEP_SCRUBBED: 1793 pgs not deep-scrubbed in time
pg 119.2b4 not deep-scrubbed since 2022-12-27T06:14:50.612728+0000
pg 0.2c2 not deep-scrubbed since 2023-01-02T08:46:19.134656+0000
pg 119.2b6 not deep-scrubbed since 2022-12-31T20:38:58.893895+0000
pg 119.2b7 not deep-scrubbed since 2022-12-30T01:19:10.708248+0000
pg 0.2c0 not deep-scrubbed since 2022-12-30T19:18:43.436138+0000
pg 119.2c8 not deep-scrubbed since 2022-12-26T20:09:24.668798+0000
pg 0.2bf not deep-scrubbed since 2022-12-31T01:43:15.081715+0000
pg 119.2c9 not deep-scrubbed since 2022-12-31T10:07:37.742396+0000
pg 0.2be not deep-scrubbed since 2023-01-02T01:27:06.871557+0000
pg 119.2ca not deep-scrubbed since 2022-12-31T19:01:11.250270+0000
pg 119.2cb not deep-scrubbed since 2023-01-02T13:24:05.693997+0000
pg 119.2cc not deep-scrubbed since 2022-12-31T22:37:26.859288+0000
pg 119.2cd not deep-scrubbed since 2023-01-01T23:14:27.223663+0000
pg 0.2ba not deep-scrubbed since 2022-12-27T06:01:07.199822+0000
pg 119.2ce not deep-scrubbed since 2022-12-26T23:22:12.447319+0000
pg 0.2b9 not deep-scrubbed since 2022-12-28T14:55:05.219263+0000
pg 119.2cf not deep-scrubbed since 2022-12-31T13:34:30.875371+0000
pg 0.2b8 not deep-scrubbed since 2022-12-30T03:21:41.014742+0000
pg 119.2c0 not deep-scrubbed since 2023-01-01T04:07:06.900046+0000
pg 0.2b7 not deep-scrubbed since 2022-12-27T04:55:12.830118+0000
pg 119.2c1 not deep-scrubbed since 2022-12-31T16:49:19.678254+0000
pg 119.2c3 not deep-scrubbed since 2023-01-02T16:47:24.024740+0000
pg 119.2c4 not deep-scrubbed since 2022-12-27T14:28:20.564920+0000
pg 0.2b3 not deep-scrubbed since 2022-12-31T00:23:48.933501+0000
pg 119.2c5 not deep-scrubbed since 2023-01-01T01:52:13.277090+0000
pg 0.2b2 not deep-scrubbed since 2022-12-26T12:14:42.502961+0000
pg 119.2c6 not deep-scrubbed since 2022-12-27T08:06:12.711551+0000
pg 0.2b0 not deep-scrubbed since 2022-12-28T19:49:10.664543+0000
pg 119.2d8 not deep-scrubbed since 2023-01-01T20:44:23.076772+0000
pg 119.2d9 not deep-scrubbed since 2022-12-29T02:58:32.604225+0000
pg 0.2ae not deep-scrubbed since 2022-12-27T15:03:31.506979+0000
pg 119.2da not deep-scrubbed since 2023-01-01T02:54:30.862142+0000
pg 0.2ad not deep-scrubbed since 2022-12-29T10:20:11.619085+0000
pg 0.2ac not deep-scrubbed since 2022-12-27T10:28:13.945387+0000
pg 0.2ab not deep-scrubbed since 2022-12-26T00:21:39.670081+0000
pg 119.2dd not deep-scrubbed since 2022-12-28T16:41:59.317013+0000
pg 0.2aa not deep-scrubbed since 2022-12-30T06:00:26.282533+0000
pg 119.2de not deep-scrubbed since 2022-12-25T15:17:00.244522+0000
pg 0.2a9 not deep-scrubbed since 2023-01-01T05:54:24.639117+0000
pg 119.2df not deep-scrubbed since 2022-12-30T20:00:46.192152+0000
pg 0.2a8 not deep-scrubbed since 2022-12-28T07:16:33.555427+0000
pg 0.2a6 not deep-scrubbed since 2022-12-28T05:31:03.112043+0000
pg 119.2d2 not deep-scrubbed since 2022-12-30T02:11:00.989507+0000
pg 119.2d3 not deep-scrubbed since 2022-12-29T18:56:42.958643+0000
pg 119.2d5 not deep-scrubbed since 2022-12-28T18:19:52.554791+0000
pg 0.2a2 not deep-scrubbed since 2023-01-02T01:45:14.599850+0000
pg 119.2d6 not deep-scrubbed since 2023-01-02T15:49:24.603043+0000
pg 119.2e8 not deep-scrubbed since 2022-12-30T23:32:57.883423+0000
pg 0.29f not deep-scrubbed since 2023-01-01T11:18:18.415364+0000
pg 119.2e9 not deep-scrubbed since 2022-12-31T17:40:22.072385+0000
1743 more pgs...
[WRN] PG_NOT_SCRUBBED: 1807 pgs not scrubbed in time
pg 119.2b4 not scrubbed since 2023-01-02T10:12:21.751362+0000
pg 0.2c2 not scrubbed since 2023-01-02T08:46:19.134656+0000
pg 119.2b6 not scrubbed since 2022-12-31T20:38:58.893895+0000
pg 119.2b7 not scrubbed since 2023-01-01T12:55:04.671272+0000
pg 0.2c0 not scrubbed since 2023-01-02T05:56:57.855432+0000
pg 119.2c8 not scrubbed since 2023-01-01T20:24:49.943938+0000
pg 0.2bf not scrubbed since 2023-01-01T11:45:38.552610+0000
pg 119.2c9 not scrubbed since 2023-01-01T21:45:08.349107+0000
pg 0.2be not scrubbed since 2023-01-02T01:27:06.871557+0000
pg 119.2ca not scrubbed since 2023-01-02T05:46:12.068516+0000
pg 119.2cb not scrubbed since 2023-01-02T13:24:05.693997+0000
pg 119.2cc not scrubbed since 2023-01-02T10:01:59.152581+0000
pg 119.2cd not scrubbed since 2023-01-01T23:14:27.223663+0000
pg 0.2ba not scrubbed since 2023-01-02T09:26:59.239127+0000
pg 119.2ce not scrubbed since 2023-01-02T10:30:58.639675+0000
pg 0.2b9 not scrubbed since 2023-01-01T14:29:29.466896+0000
pg 119.2cf not scrubbed since 2023-01-01T18:13:59.491278+0000
pg 0.2b8 not scrubbed since 2023-01-01T03:56:55.993087+0000
pg 119.2c0 not scrubbed since 2023-01-02T09:13:00.551010+0000
pg 0.2b7 not scrubbed since 2023-01-02T12:12:55.946488+0000
pg 119.2c1 not scrubbed since 2023-01-02T03:27:23.897162+0000
pg 119.2c3 not scrubbed since 2023-01-02T16:47:24.024740+0000
pg 119.2c4 not scrubbed since 2023-01-02T06:31:44.052880+0000
pg 0.2b3 not scrubbed since 2023-01-01T05:37:39.930279+0000
pg 119.2c5 not scrubbed since 2023-01-02T04:07:24.719026+0000
pg 0.2b2 not scrubbed since 2023-01-01T16:35:31.354667+0000
pg 119.2c6 not scrubbed since 2023-01-02T15:07:38.887440+0000
pg 0.2b0 not scrubbed since 2023-01-01T09:55:41.298467+0000
pg 119.2d8 not scrubbed since 2023-01-01T20:44:23.076772+0000
pg 119.2d9 not scrubbed since 2023-01-01T20:39:23.873618+0000
pg 0.2ae not scrubbed since 2023-01-02T00:03:56.659878+0000
pg 119.2da not scrubbed since 2023-01-02T11:57:52.385493+0000
pg 0.2ad not scrubbed since 2023-01-01T22:44:39.834760+0000
pg 0.2ac not scrubbed since 2023-01-01T13:30:57.946303+0000
pg 0.2ab not scrubbed since 2023-01-01T19:36:35.559123+0000
pg 119.2dd not scrubbed since 2023-01-01T12:44:18.758947+0000
pg 0.2aa not scrubbed since 2023-01-01T17:45:27.847279+0000
pg 119.2de not scrubbed since 2022-12-31T12:09:34.820962+0000
pg 0.2a9 not scrubbed since 2023-01-02T10:31:36.723621+0000
pg 119.2df not scrubbed since 2023-01-02T06:54:44.847575+0000
pg 0.2a8 not scrubbed since 2023-01-01T14:37:30.598536+0000
pg 0.2a6 not scrubbed since 2023-01-01T11:31:13.518760+0000
pg 119.2d2 not scrubbed since 2023-01-01T14:26:07.502322+0000
pg 119.2d3 not scrubbed since 2023-01-02T11:29:41.310796+0000
pg 119.2d5 not scrubbed since 2023-01-02T00:04:25.884280+0000
pg 0.2a2 not scrubbed since 2023-01-02T01:45:14.599850+0000
pg 119.2d6 not scrubbed since 2023-01-02T15:49:24.603043+0000
pg 119.2e8 not scrubbed since 2023-01-01T11:43:41.755575+0000
pg 0.29f not scrubbed since 2023-01-01T11:18:18.415364+0000
pg 119.2e9 not scrubbed since 2023-01-01T21:24:37.756534+0000
1757 more pgs...

Actions #5

Updated by Neha Ojha over 1 year ago

  • Description updated (diff)
Actions #6

Updated by Prashant D over 1 year ago

From ceph pg dump output :

  • PG size for largest pool on the LRC cluster
    Avg PG size for pool 119 is around 110 GiB
    cephfs.teuthology.data-ec 119 1024 110 TiB 66.38M 168 TiB 22.33 390 TiB
  • PG not deep-scrubbed
    Oldest deep-scrub timestamp : 2022-12-24T03:32:37.140813+0000
    Newest deep-scrub timestamp : 2023-01-15T10:15:02.115763+0000

Approx. 22 days required to complete the deep-scrubbing for all PGs.

  • PG not scrubbed
    Oldest scrub timestamp : 2022-12-30T14:54:19.783024+0000
    Newest scrub timestamp : 2023-01-15T10:15:02.115763+0000

Approx. 16 days required to complete the scrubbing for all PGs.

  • Based on LAST_SCRUB_DURATION stats, for most of the PGs the deep-scrubbing is getting completed within 1 hour timeframe and scrubbing within 1-10 seconds.
    LAST_SCRUB_DURATION
    1528
    1960
    1969
    2355
    2631
    2639
    2822

We will need current scrub setting from the cluster :

On ivan01 node, get osd.20 config dump :
  1. cephadm shell
  2. ceph daemon osd.20 config show|grep scrub

Or

Collect below commands output from cephadm admin host :
  1. cephadm shell
  2. ceph config get osd osd_scrub_begin_week_day
  3. ceph config get osd osd_scrub_end_week_day
  4. ceph config get osd osd_scrub_begin_hour
  5. ceph config get osd osd_scrub_end_hour
  6. ceph config get osd osd_scrub_min_interval
  7. ceph config get osd osd_scrub_max_interval
  8. ceph config get osd osd_deep_scrub_interval
  9. ceph config get osd osd_max_scrubs

We can consider increasing osd_max_scrubs during non-production to avoid slowing down client ops and higher network overhead. Alternatively, a better approach will be to increase the scrub window to allow the cluster to finish the scrubbing activity.

Actions #7

Updated by adam kraitman over 1 year ago

Hey Prashant I don't have cephadm on ivan01

root@ivan01:~# root@ivan01:~# cephadm shell

Command 'cephadm' not found, but can be installed with:

root@ivan01:~# apt-get install cephadm --fix-missing
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
cephadm
0 upgraded, 1 newly installed, 0 to remove and 47 not upgraded.
Need to get 69.2 kB of archives.
After this operation, 343 kB of additional disk space will be used.
Err:1 https://download.ceph.com/debian-quincy focal/main amd64 cephadm amd64 17.2.0-1focal
404 Not Found [IP: 158.69.68.124 443]
E: Failed to fetch https://download.ceph.com/debian-quincy/pool/main/c/ceph/cephadm_17.2.0-1focal_amd64.deb 404 Not Found [IP: 158.69.68.124 443]
E: Internal Error, ordering was unable to handle the media swap

Actions

Also available in: Atom PDF