Project

General

Profile

Actions

Bug #58498

open

ceph: pgs stuck backfilling

Added by Samuel Just over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Observed on the LRC cluster after resolving https://tracker.ceph.com/issues/58460

  cluster:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
    id:     28f7427e-5558-4ffd-ae1a-51ec3042759a
    health: HEALTH_WARN
            Degraded data redundancy: 145548/540755983 objects degraded (0.027%), 2 pgs degraded, 2 pgs undersized
            1470 pgs not deep-scrubbed in time
            1470 pgs not scrubbed in time

  services:
    mon:         5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 3d)
    mgr:         reesi006.erytot(active, since 3w), standbys: reesi005.xxyjcw, reesi004.tplfrt
    mds:         4/4 daemons up, 5 standby, 1 hot standby
    osd:         166 osds: 166 up (since 5h), 166 in (since 13d); 16 remapped pgs
    rgw:         2 daemons active (2 hosts, 1 zones)
    tcmu-runner: 4 portals active (4 hosts)

  data:
    volumes: 4/4 healthy
    pools:   24 pools, 2965 pgs
    objects: 112.97M objects, 127 TiB
    usage:   224 TiB used, 836 TiB / 1.0 PiB avail
    pgs:     145548/540755983 objects degraded (0.027%)
             665241/540755983 objects misplaced (0.123%)
             2942 active+clean
             14   active+remapped+backfilling
             7    active+clean+scrubbing+deep
             2    active+undersized+degraded+remapped+backfilling

  io:
    client:   177 KiB/s rd, 4.7 MiB/s wr, 6 op/s rd, 28 op/s wr
    recovery: 0 B/s, 6 objects/s

  progress:
    Global Recovery Event (2w)
      [===========================.] (remaining: 2h)

For most of the pgs, recovery is not proceeding at all.

sjust@reesi002:~/2023-01-18-lrc-investigation/reesi002$ sudo ceph pg dump | grep backfilling | grep -v 'dumped all'                                                                                                                                                                                                                                                                                                                                                                           
0.24d       2540                   0         0       2540        0    9184352048            0           0   2664      2664                      active+remapped+backfilling  2023-01-02T16:08:14.354424+0000   10675336'5920294  10682782:18541558                [44,98,58]          44                [41,98,58]              41   10675336'5920294  2023-01-02T11:11:45.415900+0000   10675336'5920294  2022-12-31T04:42:51.776041+0000              0                    1  queued for deep scrub                                                        2540                0
119.25d    65450                   0         0      60524        0  118040914986            0           0   7635      7635                      active+remapped+backfilling  2023-01-02T16:08:14.161260+0000   10682783'3606255  10682783:10902266    [27,161,89,120,72,125]          27    [41,161,89,120,72,125]              41   10676278'3594937  2022-12-31T01:03:38.315674+0000   10675860'3589907  2022-12-27T05:27:21.909004+0000              0                   13  queued for deep scrub                                                       59930                0
0.1a7       2547                   0         0       2277        0    9240477039            0           0   2708      2708                      active+remapped+backfilling  2023-01-02T16:08:17.214871+0000   10675299'6355360  10682782:17752489              [42,134,131]          42              [41,134,131]              41   10675299'6355360  2023-01-02T05:19:35.110378+0000   10675299'6355360  2023-01-02T05:19:35.110378+0000              0                  254  queued for deep scrub                                                        2547                0
0.19d       2698                   0         0       2418        0    9774354741            0           0   2658      2658                      active+remapped+backfilling  2023-01-02T16:08:14.289081+0000   10675706'6186906  10682782:17607234              [42,152,110]          42              [41,152,110]              41   10675706'6186906  2023-01-01T17:59:01.366870+0000   10675706'6186906  2022-12-31T14:38:58.383181+0000              0                    1  queued for deep scrub                                                        2698                0
119.d3     65969                   0         0     193815        0  117703625513            0           0   7607      7607                      active+remapped+backfilling  2023-01-02T16:08:23.978465+0000   10682783'3611895  10682783:11308049     [44,118,94,111,89,14]          44     [41,49,94,111,89,118]              41   10676765'3601252  2023-01-01T21:58:55.209069+0000   10676765'3601252  2023-01-01T21:58:55.209069+0000              0                 1345  queued for deep scrub                                                       60475                0
114.d6    152158                   0    136190          0        0      21296461            0           0   3197      3197  active+undersized+degraded+remapped+backfilling  2023-01-18T23:08:22.508697+0000   10682783'8022177  10682783:16372025              [59,118,128]          59                 [128,118]             128   10676768'8007280  2023-01-01T22:12:37.218704+0000   10676100'8002894  2022-12-28T20:25:52.849234+0000              0                   21  queued for deep scrub                                                      138310                0
119.1b     65967                   0         0      65967        0  119233083117            0           0   7638      7638                      active+remapped+backfilling  2023-01-02T16:08:18.617519+0000   10682782'3597673  10682782:10560296    [27,161,114,29,134,56]          27    [41,161,114,29,134,56]              41   10676772'3586960  2023-01-01T23:15:29.597688+0000   10676242'3586378  2022-12-30T01:35:48.857076+0000              0                   11  queued for deep scrub                                                       60571                0
119.1c     65933                   0         0      65626        0  119149682806            0           0   7545      7545                      active+remapped+backfilling  2023-01-02T16:08:14.309859+0000   10682782'3598669  10682782:11055329      [24,59,69,16,127,68]          24      [41,59,69,16,127,68]              41   10676872'3587860  2023-01-02T15:23:00.227516+0000   10676872'3587860  2023-01-02T15:23:00.227516+0000              0                 1960  queued for deep scrub                                                       60925                0
114.70    152214                   0      9019          0        0      50474801            0           0   2552      2552  active+undersized+degraded+remapped+backfilling  2023-01-18T23:08:22.515880+0000   10682783'7970909  10682783:17076347              [49,161,119]          49                 [119,161]             119   10676589'7955714  2023-01-01T10:06:25.077020+0000   10676452'7955711  2022-12-31T09:54:07.920177+0000              0                   21  queued for deep scrub                                                      137979                0
119.a2     65720                   0         0      65720        0  118804757552            0           0   7590      7590                      active+remapped+backfilling  2023-01-02T16:08:24.588335+0000   10682782'3613140   10682782:9884637      [44,117,63,18,30,77]          44      [41,117,63,18,30,77]              41   10676728'3602120  2023-01-01T16:44:11.572865+0000   10676728'3602120  2023-01-01T16:44:11.572865+0000              0                 1969  queued for deep scrub                                                       60092                0
0.ee        2700                   0         0       4900        0    9716980648            0           0   2711      2711                      active+remapped+backfilling  2023-01-02T16:08:26.105168+0000   10675306'5989142  10682782:17648413               [90,71,160]          90                [41,71,51]              41   10675306'5989142  2023-01-02T03:57:15.992196+0000   10675306'5989142  2023-01-02T03:57:15.992196+0000              0                  203  queued for deep scrub                                                        2700                0
119.13e    65851                   0         0      65573        0  118382459569            0           0   7567      7567                      active+remapped+backfilling  2023-01-02T16:08:14.310947+0000   10682783'3599482  10682783:11087707      [44,61,123,12,48,28]          44      [41,61,123,12,48,28]              41   10676278'3588377  2022-12-31T04:50:40.590459+0000   10676237'3588314  2022-12-29T21:44:43.925143+0000              0                   11  queued for deep scrub                                                       60305                0
0.310       2534                   0         0       2534        0    9257925379            0           0   2680      2680                      active+remapped+backfilling  2023-01-02T16:08:14.354476+0000   10675347'6226589  10682782:17549076               [24,76,155]          24               [41,76,155]              41   10675347'6226589  2023-01-02T02:36:57.124264+0000   10675347'6226589  2022-12-29T13:17:02.127406+0000              0                    1  queued for deep scrub                                                        2534                0
119.31a    65464                   0         0      65464        0  117938540621            0           0   7589      7589                      active+remapped+backfilling  2023-01-02T16:08:14.354365+0000   10682783'3605046  10682783:11453191       [43,160,5,89,16,72]          43       [41,160,5,89,16,72]              41   10676772'3594695  2023-01-01T23:23:55.938149+0000   10676278'3594272  2022-12-31T21:12:35.209263+0000              0                   11  queued for deep scrub                                                       60198                0
dumped all
119.304    65530                   0         0      65530        0  119050260451            0           0   7634      7634                      active+remapped+backfilling  2023-01-02T16:08:20.108832+0000   10682783'3618838  10682783:10395163         [26,21,68,3,7,84]          26         [41,21,68,3,7,84]              41   10676768'3607766  2023-01-01T22:12:15.129272+0000   10675906'3603297  2022-12-28T06:31:23.333029+0000              0                   11  queued for deep scrub                                                       60027                0
0.3c8       2614                   0         0       2354        0    9288911888            0           0   2667      2667                      active+remapped+backfilling  2023-01-02T16:08:14.400845+0000   10675405'6245090  10682782:18518475                [42,61,90]          42                [41,61,90]              41   10675405'6245090  2023-01-01T03:27:31.597761+0000   10675405'6245090  2022-12-31T01:35:47.818437+0000              0                    1  queued for deep scrub                             

Note, all but 2 have 41 as the primary. Earlier in the day, there were another 16 with 52 as the primary. Marking 52 down resolved those stuck pgs. ceph pg <pgid> query hung for those pgs with 52 as the primary at the time. The ones with 41 as the primary, however, do not. Attached is a file with the pg query results for those pgs.


Files

pg_query_output.json.gz (279 KB) pg_query_output.json.gz pg query output for pgs stuck, mainly with 41 as primary Samuel Just, 01/19/2023 04:29 AM
osd_128_from_file_attempt_2.log.xz (456 KB) osd_128_from_file_attempt_2.log.xz Samuel Just, 01/19/2023 09:09 PM

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #58529: osd: very slow recovery due to delayed push reply messagesResolvedSridhar Seshasayee

Actions
Actions

Also available in: Atom PDF