Project

General

Profile

Actions

Bug #62811

open

PGs stuck in backfilling state after their primary OSD is removed by setting its crush weight to 0.

Added by Sridhar Seshasayee 8 months ago. Updated 8 months ago.

Status:
New
Priority:
High
Category:
Backfill/Recovery
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am pasting the problem description from the original BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2233777

Description of problem:

We are observing that the OSDs get stuck in draining process, and all the PGs are not removed the the OSD. it looks like the drain is stuck on PGs where the OSD being removed, is the Primary OSD in that PG.

We had these configs set on the cluster during testing.

mon                                            advanced  mon_osd_down_out_subtree_limit         host
osd                                            advanced  osd_async_recovery_min_cost            1099511627776

Procedure:
1. Deploy a 4 node RHCS cluster, with 2+2 EC pool and fill data.
# ceph -s
  cluster:
    id:     66070a80-2f84-11ee-bc2c-0cc47af3ea56
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum argo012,argo013,argo014 (age 8d)
    mgr: argo013.akdhka(active, since 12d), standbys: argo012.odttqx, argo014.xfhnzv
    osd: 35 osds: 35 up (since 67m), 35 in (since 30h)
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    pools:   7 pools, 673 pgs
    objects: 4.28M objects, 3.0 TiB
    usage:   6.6 TiB used, 7.6 TiB / 14 TiB avail
    pgs:     665 active+clean
             7   active+clean+scrubbing+deep
             1   active+clean+scrubbing

  io:
    client:   7.2 KiB/s rd, 0 B/s wr, 7 op/s rd, 4 op/s wr

[ceph: root@argo012 /]# ceph df
--- RAW STORAGE ---
CLASS    SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    14 TiB  7.6 TiB  6.6 TiB   6.6 TiB      46.42
TOTAL  14 TiB  7.6 TiB  6.6 TiB   6.6 TiB      46.42

--- POOLS ---
POOL                       ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                        1    1   23 MiB        5   68 MiB      0    1.8 TiB
.rgw.root                   2   32  1.3 KiB        4   48 KiB      0    1.8 TiB
default.rgw.log             3   32  3.6 KiB      209  408 KiB      0    1.8 TiB
default.rgw.control         4   32      0 B        8      0 B      0    1.8 TiB
default.rgw.meta            5   32  8.9 KiB       18  207 KiB      0    1.8 TiB
ec22-pool                   6  512  3.0 TiB    4.28M  6.1 TiB  52.35    2.8 TiB
default.rgw.buckets.index   7   32  1.1 GiB       90  3.2 GiB   0.06    1.8 TiB

# ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         14.21661         -   14 TiB  6.6 TiB  6.1 TiB  3.2 GiB   61 GiB  7.6 TiB  46.42  1.00    -          root default
-7          3.65570         -  3.7 TiB  1.7 TiB  1.5 TiB  642 MiB   15 GiB  2.0 TiB  45.23  0.97    -              host argo012
 3    hdd   0.40619   1.00000  416 GiB  183 GiB  167 GiB  111 MiB  1.6 GiB  233 GiB  43.89  0.95   74      up          osd.3
 6    hdd   0.40619   1.00000  416 GiB  194 GiB  178 GiB   57 MiB  1.4 GiB  222 GiB  46.68  1.01   64      up          osd.6
10    hdd   0.40619   1.00000  416 GiB  176 GiB  160 GiB   23 MiB  1.4 GiB  240 GiB  42.32  0.91   61      up          osd.10
15    hdd   0.40619   1.00000  416 GiB  191 GiB  175 GiB   26 MiB  2.0 GiB  225 GiB  45.95  0.99   71      up          osd.15
19    hdd   0.40619   1.00000  416 GiB  189 GiB  173 GiB  180 MiB  1.5 GiB  227 GiB  45.32  0.98   69      up          osd.19
23    hdd   0.40619   1.00000  416 GiB  177 GiB  161 GiB   33 KiB  1.5 GiB  239 GiB  42.43  0.91   67      up          osd.23
27    hdd   0.40619   1.00000  416 GiB  185 GiB  169 GiB  2.9 MiB  1.4 GiB  231 GiB  44.58  0.96   69      up          osd.27
31    hdd   0.40619   1.00000  416 GiB  204 GiB  188 GiB  233 MiB  1.6 GiB  212 GiB  49.06  1.06   81      up          osd.31
35    hdd   0.40619   1.00000  416 GiB  195 GiB  179 GiB  9.8 MiB  2.2 GiB  221 GiB  46.87  1.01   75      up          osd.35
-3          3.65570         -  3.7 TiB  1.7 TiB  1.5 TiB  979 MiB   15 GiB  2.0 TiB  45.23  0.97    -              host argo013
 0    hdd   0.40619   1.00000  416 GiB  179 GiB  163 GiB  312 MiB  1.8 GiB  237 GiB  43.05  0.93   67      up          osd.0
 4    hdd   0.40619   1.00000  416 GiB  191 GiB  175 GiB  149 MiB  1.8 GiB  225 GiB  46.01  0.99   77      up          osd.4
 8    hdd   0.40619   1.00000  416 GiB  186 GiB  170 GiB  134 MiB  1.3 GiB  230 GiB  44.63  0.96   72      up          osd.8
12    hdd   0.40619   1.00000  416 GiB  192 GiB  176 GiB    1 KiB  1.5 GiB  224 GiB  46.04  0.99   67      up          osd.12
16    hdd   0.40619   1.00000  416 GiB  188 GiB  172 GiB   10 MiB  1.9 GiB  228 GiB  45.27  0.98   71      up          osd.16
20    hdd   0.40619   1.00000  416 GiB  186 GiB  170 GiB   46 MiB  1.7 GiB  230 GiB  44.70  0.96   71      up          osd.20
24    hdd   0.40619   1.00000  416 GiB  195 GiB  179 GiB   77 MiB  1.5 GiB  221 GiB  46.78  1.01   72      up          osd.24
28    hdd   0.40619   1.00000  416 GiB  179 GiB  163 GiB   90 MiB  1.1 GiB  237 GiB  43.06  0.93   58      up          osd.28
32    hdd   0.40619   1.00000  416 GiB  198 GiB  182 GiB  161 MiB  1.9 GiB  218 GiB  47.56  1.02   76      up          osd.32
-5          3.65570         -  3.7 TiB  1.7 TiB  1.5 TiB  683 MiB   16 GiB  2.0 TiB  45.23  0.97    -              host argo014
 2    hdd   0.40619   1.00000  416 GiB  183 GiB  167 GiB  136 MiB  1.8 GiB  233 GiB  44.02  0.95   72      up          osd.2
 7    hdd   0.40619   1.00000  416 GiB  185 GiB  169 GiB  9.7 MiB  1.6 GiB  231 GiB  44.54  0.96   69      up          osd.7
11    hdd   0.40619   1.00000  416 GiB  206 GiB  190 GiB    1 KiB  2.1 GiB  210 GiB  49.59  1.07   78      up          osd.11
14    hdd   0.40619   1.00000  416 GiB  188 GiB  172 GiB  5.2 MiB  1.8 GiB  228 GiB  45.25  0.97   61      up          osd.14
18    hdd   0.40619   1.00000  416 GiB  188 GiB  172 GiB  176 MiB  1.4 GiB  228 GiB  45.27  0.98   77      up          osd.18
22    hdd   0.40619   1.00000  416 GiB  182 GiB  166 GiB  152 MiB  1.8 GiB  234 GiB  43.79  0.94   70      up          osd.22
26    hdd   0.40619   1.00000  416 GiB  194 GiB  178 GiB   25 MiB  1.9 GiB  222 GiB  46.66  1.01   66      up          osd.26
30    hdd   0.40619   1.00000  416 GiB  192 GiB  176 GiB  107 MiB  1.8 GiB  224 GiB  46.20  1.00   73      up          osd.30
34    hdd   0.40619   1.00000  416 GiB  174 GiB  158 GiB   72 MiB  1.5 GiB  242 GiB  41.77  0.90   68      up          osd.34
-9          3.24951         -  3.2 TiB  1.6 TiB  1.5 TiB  934 MiB   16 GiB  1.6 TiB  50.41  1.09    -              host argo017
 5    hdd   0.40619   1.00000  416 GiB  234 GiB  218 GiB  162 MiB  2.2 GiB  182 GiB  56.26  1.21   86      up          osd.5
 9    hdd   0.40619   1.00000  416 GiB  207 GiB  191 GiB  124 MiB  2.3 GiB  209 GiB  49.75  1.07   83      up          osd.9
13    hdd   0.40619   1.00000  416 GiB  194 GiB  178 GiB   45 MiB  1.9 GiB  222 GiB  46.61  1.00   72      up          osd.13
17    hdd   0.40619   1.00000  416 GiB  195 GiB  179 GiB   48 MiB  1.6 GiB  221 GiB  46.86  1.01   73      up          osd.17
21    hdd   0.40619   1.00000  416 GiB  219 GiB  203 GiB  112 MiB  2.1 GiB  197 GiB  52.60  1.13   84      up          osd.21
25    hdd   0.40619   1.00000  416 GiB  195 GiB  179 GiB   36 MiB  1.7 GiB  221 GiB  46.78  1.01   72      up          osd.25
29    hdd   0.40619   1.00000  416 GiB  203 GiB  187 GiB  221 MiB  2.2 GiB  213 GiB  48.92  1.05   81      up          osd.29
33    hdd   0.40619   1.00000  416 GiB  231 GiB  215 GiB  185 MiB  2.0 GiB  185 GiB  55.48  1.20   84      up          osd.33
                        TOTAL   14 TiB  6.6 TiB  6.1 TiB  3.2 GiB   61 GiB  7.6 TiB  46.42
MIN/MAX VAR: 0.90/1.21  STDDEV: 3.24

2. In a healthy cluster, where all PGs are in active+ clean state, try removing a OSD from 1 host.
In the below example, OSD 0 from argo013 was being removed. 
# date
Tue Aug 22 12:10:27 UTC 2023

# ceph orch osd rm 0 --force --zap
Scheduled OSD(s) for removal.
[ceph: root@argo012 /]# date
Tue Aug 22 12:11:28 UTC 2023
[ceph: root@argo012 /]# ceph orch osd rm status
OSD  HOST     STATE     PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
0    argo013  draining   58  False    True   True  2023-08-22 12:11:27.848424

3. Initially, the PGs start draining, but observing that the drain stops after it is done draining around 75% PGs. When I checked the PG dump at this point, it looked like the drain is stuck on PGs where the OSD being removed, is the Primary OSD in that PG.

# ceph orch osd rm status
OSD  HOST     STATE     PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
0    argo013  draining   22  False    True   True  2023-08-22 12:11:27.848424
[root@argo012 ~]# date
Wed Aug 23 04:07:22 AM UTC 2023

# ceph orch osd rm status
OSD  HOST     STATE     PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
0    argo013  draining   22  False    True   True  2023-08-22 12:11:27.848424
[root@argo012 ~]# date
Wed Aug 23 05:38:27 AM UTC 2023

[root@argo012 ~]# ceph orch osd rm status
OSD  HOST     STATE     PGS  REPLACE  FORCE  ZAP   DRAIN STARTED AT
0    argo013  draining   22  False    True   True  2023-08-22 12:11:27.848424
[root@argo012 ~]# date
Wed Aug 23 11:01:42 AM UTC 2023

At this point, we do not see any recovery activity in ceph -s o/p also.

# ceph -s
  cluster:
    id:     66070a80-2f84-11ee-bc2c-0cc47af3ea56
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum argo012,argo013,argo014 (age 9d)
    mgr: argo013.akdhka(active, since 13d), standbys: argo012.odttqx, argo014.xfhnzv
    osd: 35 osds: 35 up (since 21h), 35 in (since 2d); 22 remapped pgs
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    pools:   7 pools, 673 pgs
    objects: 4.42M objects, 3.0 TiB
    usage:   6.6 TiB used, 7.6 TiB / 14 TiB avail
    pgs:     171950/17688441 objects misplaced (0.972%)
             651 active+clean
             16  active+remapped+backfilling
             6   active+remapped+backfill_wait

  io:
    client:   10 KiB/s rd, 143 KiB/s wr, 14 op/s rd, 27 op/s wr

[root@argo012 ~]# ceph pg dump | grep backfill
6.1bd       8679                   0         0       8679        0  6519914496            0           0   7605      7605    active+remapped+backfilling  2023-08-22T12:11:40.514461+0000   19194'35959   19194:222764     [4,3,9,18]           4     [0,3,9,18]               0   18421'35707  2023-08-20T18:45:07.674753+0000      18421'35707  2023-08-18T14:56:15.799803+0000              0                   36  queued for scrub                                                             8427                0
6.1b8       8517                   0         0      17034        0  6200819712            0           0   7393      7393    active+remapped+backfilling  2023-08-22T12:11:41.238886+0000   19194'35586   19194:218781  [20,23,26,13]          20   [0,23,26,25]               0   18422'35313  2023-08-20T17:33:29.237283+0000      18422'35313  2023-08-20T17:33:29.237283+0000              0                  175  queued for scrub                                                             8244                0
6.1af       8479                   0         0       8479        0  6370885632            0           0   7261      7261    active+remapped+backfilling  2023-08-22T12:11:40.475878+0000   19194'35370   19194:192817    [12,18,3,9]          12     [0,18,3,9]               0   18422'35131  2023-08-22T12:09:30.568001+0000      18422'35131  2023-08-22T12:09:30.568001+0000              0                  560  periodic scrub scheduled @ 2023-08-23T15:52:04.077250+0000                   8240                0
6.1aa       8662                   0         0       8662        0  6562643968            0           0   7425      7425    active+remapped+backfilling  2023-08-22T12:11:41.434841+0000   19194'35518   19194:193262   [12,11,6,13]          12    [0,11,6,13]               0   18422'35250  2023-08-20T16:57:40.112433+0000      18337'33987  2023-08-14T17:18:34.663615+0000              0                   36  queued for deep scrub                                                        8394                0
6.19d       8627                   0         0       8627        0  6525550592            0           0   7314      7314    active+remapped+backfilling  2023-08-22T12:11:39.263735+0000   19194'35434   19194:197745   [16,25,26,6]          16    [0,25,26,6]               0   18422'35151  2023-08-20T09:25:48.183947+0000      18396'34264  2023-08-16T11:51:33.787266+0000              0                   35  queued for scrub                                                             8344                0
6.180       8826                   0         0       8826        0  6708264960            0           0   6784      6784    active+remapped+backfilling  2023-08-22T12:11:40.553126+0000   19194'34345   19194:182555  [32,30,17,31]          32   [0,30,17,31]               0   18421'34084  2023-08-20T01:22:52.567792+0000      18421'34084  2023-08-20T01:22:52.567792+0000              0                  176  queued for scrub                                                             8565                0
6.59        8661                   0         0       8661        0  6445924352            0           0   7796      7796    active+remapped+backfilling  2023-08-22T14:53:10.041447+0000   19194'36355   19194:117635  [28,18,13,15]          28   [0,18,13,15]               0   18421'36082  2023-08-22T11:26:01.536220+0000      18421'36082  2023-08-22T11:26:01.536220+0000              0                  382  periodic scrub scheduled @ 2023-08-23T22:36:35.973627+0000                   8388                0
6.4d        8510                   0         0       8510        0  6429343744            0           0   7680      7680  active+remapped+backfill_wait  2023-08-22T14:48:22.748410+0000   19194'35958   19194:159698  [12,25,18,19]          12   [0,25,18,19]               0   18422'35685  2023-08-22T11:38:43.676947+0000      18411'35524  2023-08-17T05:34:58.535235+0000              0                   35  periodic scrub scheduled @ 2023-08-23T11:46:45.368400+0000                   8237                0
6.4a        8603                   0         0      17206        0  6389039104            0           0   7624      7624  active+remapped+backfill_wait  2023-08-22T14:52:58.531620+0000   19194'35895   19194:145778  [12,22,25,27]          12   [0,22,33,27]               0   18422'35640  2023-08-22T11:29:58.371470+0000      18422'35640  2023-08-19T23:11:14.163755+0000              0                   35  periodic scrub scheduled @ 2023-08-23T21:00:28.665562+0000                   8348                0
7.1b           3                   0         0          3        0           0     41666444      189764  10048     10048    active+remapped+backfilling  2023-08-22T12:11:41.250313+0000  19194'812502   19194:916015      [28,3,13]          28       [0,3,13]               0  18951'800381  2023-08-22T11:26:05.828515+0000     18951'800381  2023-08-22T11:26:05.828515+0000              0                    4  periodic scrub scheduled @ 2023-08-23T23:25:02.991379+0000                      3                0
3.6            6                   0         0         12        0           0            0           0   9495      9495  active+remapped+backfill_wait  2023-08-22T14:53:10.040326+0000  19194'153995   19194:252905     [26,13,27]          26       [0,13,2]               0  18951'148821  2023-08-22T11:37:46.197728+0000     18510'134991  2023-08-20T11:05:55.008644+0000              0                    1  periodic scrub scheduled @ 2023-08-23T19:10:36.035866+0000                      6                0
7.4            4                   0         0          4        0           0     31724167      138831  10068     10068    active+remapped+backfilling  2023-08-22T12:11:43.275135+0000  19194'416071   19194:486357        [8,9,2]           8        [0,9,2]               0  18422'403982  2023-08-20T14:21:33.629927+0000     18396'351102  2023-08-16T12:45:54.473320+0000              0                    1  queued for scrub                                                                4                0
7.6            7                   0         0          7        0           0     67901442      315891   4422      4422    active+remapped+backfilling  2023-08-22T12:11:41.434578+0000  18508'472722   19193:565304      [12,22,5]          12       [0,22,5]               0  18508'472722  2023-08-22T11:30:09.680199+0000     18508'472722  2023-08-20T09:18:04.840742+0000              0                    1  periodic scrub scheduled @ 2023-08-23T20:39:20.602754+0000                      7                0
6.16        8668                   0         0       8668        0  6560940032            0           0   6920      6920  active+remapped+backfill_wait  2023-08-22T14:48:33.793550+0000   19194'34730    19194:69433   [28,7,23,29]          28    [0,7,23,29]               0   18422'34477  2023-08-21T02:38:11.482845+0000      18333'33321  2023-08-14T10:17:48.909163+0000              0                   36  queued for deep scrub                                                        8415                0
6.b3        8588                   0         0       8588        0  6560677888            0           0   7315      7315  active+remapped+backfill_wait  2023-08-22T14:48:41.968101+0000   19194'35707   19194:159836    [4,3,17,34]           4    [0,3,17,34]               0   18422'35433  2023-08-21T00:43:44.822688+0000      18422'35433  2023-08-21T00:43:44.822688+0000              0                  151  queued for scrub                                                             8314                0
6.c2        8542                   0         0       8542        0  6335627264            0           0   7566      7566  active+remapped+backfill_wait  2023-08-22T14:52:39.323031+0000   19194'35591   19194:187451    [16,3,2,13]          16     [0,3,2,13]               0   18422'35321  2023-08-21T00:15:30.088884+0000      18406'34902  2023-08-16T22:56:36.951863+0000              0                   35  queued for scrub                                                             8272                0
6.c8        8385                   0         0       8385        0  6342639616            0           0   7740      7740    active+remapped+backfilling  2023-08-22T13:53:02.847391+0000   19194'36287   19194:155784     [8,9,3,22]           8     [0,9,3,22]               0   18422'36007  2023-08-20T18:07:00.914425+0000      18337'34831  2023-08-14T15:39:09.546106+0000              0                   35  queued for deep scrub                                                        8105                0
6.ea        8700                   0         0       8700        0  6701318144            0           0   8400      8400    active+remapped+backfilling  2023-08-22T12:37:57.250097+0000   19194'36507   19194:215173   [12,2,29,27]          12    [0,2,29,27]               0   18422'36214  2023-08-20T08:30:19.565009+0000      18422'36214  2023-08-19T02:49:58.782983+0000              0                   36  queued for scrub                                                             8407                0
6.f5        8692                   0         0       8692        0  6698369024            0           0   8315      8315    active+remapped+backfilling  2023-08-22T12:11:40.501888+0000   19194'36206   19194:305781  [12,30,31,17]          12   [0,30,31,17]               0   18422'35937  2023-08-20T17:30:28.074025+0000      18422'35937  2023-08-20T17:30:28.074025+0000              0                  170  queued for scrub                                                             8423                0
6.f7        8521                   0         0       8521        0  6457851904            0           0   8230      8230    active+remapped+backfilling  2023-08-22T12:11:40.120079+0000   19194'36752   19194:206013     [4,23,2,5]           4     [0,23,2,5]               0   18422'36454  2023-08-19T23:33:43.864135+0000      17969'35091  2023-08-13T21:55:02.307873+0000              0                   35  queued for deep scrub                                                        8223                0
dumped all
6.173       8556                   0         0       8556        0  6451757056            0           0   8113      8113    active+remapped+backfilling  2023-08-22T12:11:40.476004+0000   19194'36649   19194:163973   [24,31,5,22]          24    [0,31,5,22]               0   18422'36378  2023-08-20T06:28:28.826979+0000      18422'36378  2023-08-19T00:02:15.118486+0000              0                   35  queued for scrub                                                             8285                0
6.177       8588                   0         0       8588        0  6445072384            0           0   8132      8132    active+remapped+backfilling  2023-08-22T12:11:40.491822+0000   19194'36654   19194:177980    [8,22,29,3]           8    [0,22,29,3]               0   18422'36388  2023-08-20T05:53:14.247617+0000      18422'36388  2023-08-20T05:53:14.247617+0000              0                  155  queued for scrub                                                             8322                0

4. The PG would remain in the same state, and there would be no improvement in the drain and OSD removal. The OSD service is running and the target host during the entire procedure.

Version-Release number of selected component (if applicable):
ceph version 17.2.6-100.el9cp (ea4e3ef8df2cf26540aae06479df031dcfc80343) quincy (stable)

How reproducible:
2/2 OSDs on 1 Cluster

Steps to Reproduce:
1. Deploy 4 node RHCS cluster, deploy EC 2+2 pool, and fill with data.
2. Remove a running OSD from any host, cmd : ceph orch osd rm <id> --force
3. Observe that drain does not complete and is stuck for a few PGs on the OSD.

Actual results:
PGs stuck draining and removal does not go through

Expected results:
PGs drained and OSD removed.

Additional info:

Observed that, If i stop the service, the drain completes within a few hours, and the OSD is removed from the cluster.

Files

stuck_pgs_pg_query_6.15a_acting_primary_osd34.txt (32 KB) stuck_pgs_pg_query_6.15a_acting_primary_osd34.txt Stuck PGs with osd.34 as acting primary and pg 6.15a details Sridhar Seshasayee, 09/12/2023 04:37 AM
bz2233777_osd.34_data.tgz (210 KB) bz2233777_osd.34_data.tgz Sridhar Seshasayee, 09/12/2023 04:44 AM

Updated by Sridhar Seshasayee 8 months ago

Analysis of the issue is performed by taking a single PG (6.15a) on osd.34 and on which backfill didn't start.
I have attached a couple of log files that show:

1. PGs stuck in backfilling whose acting primary is osd.34.
2. Ceph command dumps from osd.34. (Copied from the BZ)

ANALYSIS

1. On pg 6.15as0, the first recovery item was enqueued at:

2023-09-01T01:55:35.036+0000 7fa3d9a44640 20 osd.34 op_wq(1) _enqueue OpSchedulerItem(6.15as0 PGRecovery(pgid=6.15as0 epoch_queued=21584 reserved_pushes=1) class_id 1 prio 5 cost 573165 e21584 reserved_pushes 1)

At this point, the op_wq(1) state is shown below:

2023-09-01T01:55:35.037+0000 7fa3d9a44640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":28,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

The immediate queue size was 28. These are mainly peering events and SubOpReads.
Note that the scheduler queue size is 1 indicating that the PGRecovery item was enqueued.

2. Later, apart from peering events, a SubOpRead was enqueued on another PG (6.3ds2) and on op_wq(1):

2023-09-01T01:55:35.041+0000 7fa3f52ae640 15 osd.34 21584 enqueue_op 0x55c794546840 prio 5 type 110 cost 0 latency 0.000026 epoch 21584 MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3

2023-09-01T01:55:35.041+0000 7fa3f52ae640 20 osd.34 op_wq(1) _enqueue OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584)

3. Before the above SubOpRead was dequeued, a bunch of miscellaneous peering events were enqueued and dequeued.
The above SubOpRead item was dequeued eventually after around 50 ms:

2023-09-01T01:55:35.095+0000 7fa3d9a44640 20 osd.34 op_wq(1) _process 6.3ds2 to_process <OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584)> waiting <> waiting_peering {}
2023-09-01T01:55:35.095+0000 7fa3d9a44640 20 osd.34 op_wq(1) _process OpSchedulerItem(6.3ds2 PGOpItem(op=MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3) class_id 2 prio 5 cost 0 e21584) pg 0x55c7a03dc000

...

2023-09-01T01:55:35.096+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c794546840 prio 5 cost 0 latency 0.055139 MOSDECSubOpRead(6.3ds2 21584/21582 ECSubRead(tid=53325, to_read={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=0,4194304,0}, subchunks={6:bc0000ce:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-158519:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.3ds2( v 21580'39113 (18397'34762,21580'39113] local-lis/les=21582/21583 n=11743 ec=916/247 lis/c=21582/20930 les/c/f=21583/20931/0 sis=21582) [28,6,7,25]/[28,6,34,25]p28(0) r=2 lpr=21582 pi=[20930,21582)/1 luod=0'0 crt=21580'39113 mlcod 0'0 active+remapped mbc={}]

At this stage, the op_wq(1) looked like the following:

2023-09-01T01:55:35.096+0000 7fa3d9243640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":3,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

4. Note that the mClock scheduler size is still 1 with the PGRecovery item yet to be picked up.
(Time in mClock Queue: ~59 ms)

The above SubOpRead completed at:

2023-09-01T01:55:35.211+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c794546840 finish

Therefore, the SubOpRead op took approximately 170 ms to complete from the time it was enqueued.

On op_wq(1), nothing else was processed during the time. A few more items got added to the immediate queue during this time.
This is indicated by the queue state at this point:

2023-09-01T01:55:35.211+0000 7fa3d9a44640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":6,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

The above shows that the PGRecovery item is still in the scheduler queue.
(Time spent in the scheduler queue: 175 ms)

5. Another SubOpRead is dequeued:

2023-09-01T01:55:35.215+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7f0413600 prio 5 cost 0 latency 0.101390 MOSDECSubOpRead(6.182s2 21584/21582 ECSubRead(tid=124582, to_read={6:4180003a:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-273272:head=0,4194304,0}, subchunks={6:4180003a:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-273272:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.182s2( v 21580'37910 (18416'34408,21580'37910] local-lis/les=21582/21583 n=11783 ec=16227/247 lis/c=21582/21560 les/c/f=21583/21561/0 sis=21582) [23,20,2,5]/[23,20,34,5]p23(0) r=2 lpr=21582 pi=[21560,21582)/1 luod=0'0 crt=21580'37910 mlcod 0'0 active+remapped mbc={}]

|
| Time to complete -> 70 ms
|

2023-09-01T01:55:35.285+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7f0413600 finish

Another example:

2023-09-01T01:55:35.317+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7d21af080 prio 5 cost 0 latency 0.160248 MOSDECSubOpRead(6.191s3 21584/21582 ECSubRead(tid=77137, to_read={6:89800139:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-816949:head=0,4194304,0}, subchunks={6:89800139:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-816949:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.191s3( v 21580'38412 (18409'34614,21580'38412] local-lis/les=21582/21583 n=11915 ec=16301/247 lis/c=21582/20929 les/c/f=21583/20930/0 sis=21582) [31,16,13,18]/[31,16,13,34]p31(0) r=3 lpr=21582 pi=[20929,21582)/1 luod=0'0 crt=21580'38412 mlcod 0'0 active+remapped mbc={}]

|
| Time to complete -> 210 ms
|

2023-09-01T01:55:35.527+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c7d21af080 finish

6. During the above times, additional peering and SubOpRead items get enqueued on op_wp(1) and the state is:

2023-09-01T01:55:35.722+0000 7fa3da245640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":8,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

7. The above shows that SubOpReads are taking a significant amount of time to complete (200 ms - 500 ms)
At this point another SubOpRead gets dequeued:

2023-09-01T01:55:35.692+0000 7fa3d9a44640 10 osd.34 21584 dequeue_op 0x55c8277f7340 prio 5 cost 0 latency 0.455817 MOSDECSubOpRead(6.1cds2 21584/21582 ECSubRead(tid=241069, to_read={6:b38002c2:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-1669911:head=0,4194304,0}, subchunks={6:b38002c2:::dbfcb333-d221-4952-8c94-66d33d7d1d79.25909.2_file-64K-1669911:head=[0,1]}, attrs_to_read=)) v3 pg pg[6.1cds2( v 21580'39273 (18396'34976,21580'39273] local-lis/les=21582/21583 n=11710 ec=16613/247 lis/c=21582/21558 les/c/f=21583/21559/0 sis=21582) [27,5,22,4]/[27,5,34,4]p27(0) r=2 lpr=21582 pi=[21558,21582)/1 luod=0'0 crt=21580'39273 mlcod 0'0 active+remapped mbc={}]

8. At the same time, the osd_map is advanced as a result of which a bunch of peering events (NullEvt) are generated:

2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 OSD::ms_dispatch: osd_map(21584..21585 src has 21043..21585) v4
2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 do_waiters -- start
2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 do_waiters -- finish
2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 _dispatch 0x55c7e2c288c0 osd_map(21584..21585 src has 21043..21585) v4
2023-09-01T01:55:35.827+0000 7fa3e725f640  3 osd.34 21584 handle_osd_map epochs [21584,21585], i have 21584, src has [21043,21585]
2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 handle_osd_map  got inc map for epoch 21585
2023-09-01T01:55:35.827+0000 7fa3e725f640 10 osd.34 21584 add_map_bl 21584 27223 bytes
2023-09-01T01:55:35.827+0000 7fa3e725f640 20 osd.34 21584 got_full_map 21585, nothing requested
2023-09-01T01:55:35.828+0000 7fa3e725f640 30 osd.34 21584 get_map 21584 -cached
2023-09-01T01:55:35.828+0000 7fa3e725f640 20 osd.34 21584 handle_osd_map pg_num_history pg_num_history(e21585 pg_nums {1={55=1},2={109=32},3={109=32},4={111=32},5={111=32},6={16915=512},7={416=32}} deleted_pools )

...

2023-09-01T01:55:35.829+0000 7fa3e3257640 10 osd.34 21584  advance to epoch 21585 (<= last 21585 <= newest_map 21585)
2023-09-01T01:55:35.829+0000 7fa3e3257640 30 osd.34 21584 get_map 21585 -cached
2023-09-01T01:55:35.829+0000 7fa3e3257640  7 osd.34 21585 consume_map version 21585
...
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.18fs3
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.199s2
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.1ads0
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34:4.consume_map consume_map 6.2cs0
2023-09-01T01:55:35.829+0000 7fa3e3257640 10 osd.34 21585 release_reserved_pushes(0), recovery_ops_reserved 3 -> 3    --> recovery_op_reserved is at the max level
2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 _recover_now active 0 + reserved 3 >= max 3
2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1c2s1 epoch_sent: 21585 epoch_requested: 21585 NullEvt
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34 op_wq(0) _enqueue OpSchedulerItem(6.1c2s1 PGPeeringEvent(epoch_sent: 21585 epoch_requested: 21585 NullEvt) class_id 2 prio 255 cost 10 e21585)
2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1eas2 epoch_sent: 21585 epoch_requested: 21585 NullEvt
2023-09-01T01:55:35.829+0000 7fa3e3257640 20 osd.34 op_wq(0) _enqueue OpSchedulerItem(6.1eas2 PGPeeringEvent(epoch_sent: 21585 epoch_requested: 21585 NullEvt) class_id 2 prio 255 cost 10 e21585)
2023-09-01T01:55:35.829+0000 7fa3e3257640 15 osd.34 21585 enqueue_peering_evt 6.1d6s1 epoch_sent: 21585 epoch_requested: 21585 NullEvt

Due to the above, the immediate queue size increased significantly (19) with recovery_ops_reserved still set to 3:

2023-09-01T01:55:35.966+0000 7fa3d8a42640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":19,"scheduler":1},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:1 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

9. The SubOpRead that was dequeued before the Advmap completed around this time (Time to finish: ~340 ms):

2023-09-01T01:55:36.035+0000 7fa3d9a44640 10 osd.34 21585 dequeue_op 0x55c8277f7340 finish

Note that the PGRecovery item is still in the scheduler queue. The above cycle repeats and the
recovery item is never dequeued resulting in backfill operations getting stuck indefinitely.

Even towards the end of the logs on osd.34, the PGRecovery item can still be seen in the scheduler queue
as shown below:

2023-09-01T01:56:11.125+0000 7fa3d8a42640 30 dequeue status:
...
"OSD:ShardedOpWQ:1":{"queue_sizes":{"immediate":10,"scheduler":2},"mClockClients":{"client_count":3,"clients":"{ PriorityQueue::  { client:{ class_id: 1 client_id: 0 profile_id: 0 }, record:{ ClientRec:: client:{ class_id: 1 client_id: 0 profile_id: 0 } prev_tag:{ RequestTag:: ready:false r:0.0000 p:909382.4800 l:533335.0377 } req_count:2 top_req:{ ClientReq:: tag:{ RequestTag:: ready:false r:max p:909382.4800 l:533335.0377 } client:{ class_id: 1 client_id: 0 profile_id: 0 } } } }
...

This is ~36 secs since the PGRecovery item was enqueued!

The above would result in timeouts on the client and slow ops reported in the cluster logs since IOs
cannot be serviced due to the primary not getting backfilled.

ISSUE SUMMARY:

This is a case where backfill is stuck when,
  1. The target being backfilled is for the primary OSD -> new up osd for the acting primary.
  2. There are slow SubOpReads issued by other PGs during which the osd shard waits.
  3. PGRecovery item is continuously preempted by peering events as part of AdvMap in addition to SubOpReads issued by other PGs. This is a normal sequence but resulted in the starvation of the items in the mClock queue.

POSSIBLE SOLUTION:

In this situation where the primary itself is being removed and has a backfill target, the PGRecovery item should probably be accorded high priority so that mClockScheduler can schedule this as early as possible.

  1. Pull in the changes made for tracker https://tracker.ceph.com/issues/62293. This is because dump op queue in some cases shows more than the expected number of mclock clients (3 Nos). The fix ensures that a service type (client, bg recovery or best effort) with different client_ids are bucketed into the relevant scheduler class by ignoring client_ids. This guarantees that a scheduler class doesn't use more than the allocated osd IOPS capacity limit.
  2. Backport the high priority queue implementation already available in upstream reef.
  3. Come up with a solution to consider the case where the primary itself has a backfill target and assign a high priority for the PGRecovery item. This will ensure that the recovery item gets picked up as soon as possible thus allowing backfill to progress according to the mClock profile settings.

The above requires further discussion.

Actions

Also available in: Atom PDF