Project

General

Profile

Bug #58498

ceph: pgs stuck backfilling

Added by Samuel Just about 1 year ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Observed on the LRC cluster after resolving https://tracker.ceph.com/issues/58460

  cluster:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
    id:     28f7427e-5558-4ffd-ae1a-51ec3042759a
    health: HEALTH_WARN
            Degraded data redundancy: 145548/540755983 objects degraded (0.027%), 2 pgs degraded, 2 pgs undersized
            1470 pgs not deep-scrubbed in time
            1470 pgs not scrubbed in time

  services:
    mon:         5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 3d)
    mgr:         reesi006.erytot(active, since 3w), standbys: reesi005.xxyjcw, reesi004.tplfrt
    mds:         4/4 daemons up, 5 standby, 1 hot standby
    osd:         166 osds: 166 up (since 5h), 166 in (since 13d); 16 remapped pgs
    rgw:         2 daemons active (2 hosts, 1 zones)
    tcmu-runner: 4 portals active (4 hosts)

  data:
    volumes: 4/4 healthy
    pools:   24 pools, 2965 pgs
    objects: 112.97M objects, 127 TiB
    usage:   224 TiB used, 836 TiB / 1.0 PiB avail
    pgs:     145548/540755983 objects degraded (0.027%)
             665241/540755983 objects misplaced (0.123%)
             2942 active+clean
             14   active+remapped+backfilling
             7    active+clean+scrubbing+deep
             2    active+undersized+degraded+remapped+backfilling

  io:
    client:   177 KiB/s rd, 4.7 MiB/s wr, 6 op/s rd, 28 op/s wr
    recovery: 0 B/s, 6 objects/s

  progress:
    Global Recovery Event (2w)
      [===========================.] (remaining: 2h)

For most of the pgs, recovery is not proceeding at all.

sjust@reesi002:~/2023-01-18-lrc-investigation/reesi002$ sudo ceph pg dump | grep backfilling | grep -v 'dumped all'                                                                                                                                                                                                                                                                                                                                                                           
0.24d       2540                   0         0       2540        0    9184352048            0           0   2664      2664                      active+remapped+backfilling  2023-01-02T16:08:14.354424+0000   10675336'5920294  10682782:18541558                [44,98,58]          44                [41,98,58]              41   10675336'5920294  2023-01-02T11:11:45.415900+0000   10675336'5920294  2022-12-31T04:42:51.776041+0000              0                    1  queued for deep scrub                                                        2540                0
119.25d    65450                   0         0      60524        0  118040914986            0           0   7635      7635                      active+remapped+backfilling  2023-01-02T16:08:14.161260+0000   10682783'3606255  10682783:10902266    [27,161,89,120,72,125]          27    [41,161,89,120,72,125]              41   10676278'3594937  2022-12-31T01:03:38.315674+0000   10675860'3589907  2022-12-27T05:27:21.909004+0000              0                   13  queued for deep scrub                                                       59930                0
0.1a7       2547                   0         0       2277        0    9240477039            0           0   2708      2708                      active+remapped+backfilling  2023-01-02T16:08:17.214871+0000   10675299'6355360  10682782:17752489              [42,134,131]          42              [41,134,131]              41   10675299'6355360  2023-01-02T05:19:35.110378+0000   10675299'6355360  2023-01-02T05:19:35.110378+0000              0                  254  queued for deep scrub                                                        2547                0
0.19d       2698                   0         0       2418        0    9774354741            0           0   2658      2658                      active+remapped+backfilling  2023-01-02T16:08:14.289081+0000   10675706'6186906  10682782:17607234              [42,152,110]          42              [41,152,110]              41   10675706'6186906  2023-01-01T17:59:01.366870+0000   10675706'6186906  2022-12-31T14:38:58.383181+0000              0                    1  queued for deep scrub                                                        2698                0
119.d3     65969                   0         0     193815        0  117703625513            0           0   7607      7607                      active+remapped+backfilling  2023-01-02T16:08:23.978465+0000   10682783'3611895  10682783:11308049     [44,118,94,111,89,14]          44     [41,49,94,111,89,118]              41   10676765'3601252  2023-01-01T21:58:55.209069+0000   10676765'3601252  2023-01-01T21:58:55.209069+0000              0                 1345  queued for deep scrub                                                       60475                0
114.d6    152158                   0    136190          0        0      21296461            0           0   3197      3197  active+undersized+degraded+remapped+backfilling  2023-01-18T23:08:22.508697+0000   10682783'8022177  10682783:16372025              [59,118,128]          59                 [128,118]             128   10676768'8007280  2023-01-01T22:12:37.218704+0000   10676100'8002894  2022-12-28T20:25:52.849234+0000              0                   21  queued for deep scrub                                                      138310                0
119.1b     65967                   0         0      65967        0  119233083117            0           0   7638      7638                      active+remapped+backfilling  2023-01-02T16:08:18.617519+0000   10682782'3597673  10682782:10560296    [27,161,114,29,134,56]          27    [41,161,114,29,134,56]              41   10676772'3586960  2023-01-01T23:15:29.597688+0000   10676242'3586378  2022-12-30T01:35:48.857076+0000              0                   11  queued for deep scrub                                                       60571                0
119.1c     65933                   0         0      65626        0  119149682806            0           0   7545      7545                      active+remapped+backfilling  2023-01-02T16:08:14.309859+0000   10682782'3598669  10682782:11055329      [24,59,69,16,127,68]          24      [41,59,69,16,127,68]              41   10676872'3587860  2023-01-02T15:23:00.227516+0000   10676872'3587860  2023-01-02T15:23:00.227516+0000              0                 1960  queued for deep scrub                                                       60925                0
114.70    152214                   0      9019          0        0      50474801            0           0   2552      2552  active+undersized+degraded+remapped+backfilling  2023-01-18T23:08:22.515880+0000   10682783'7970909  10682783:17076347              [49,161,119]          49                 [119,161]             119   10676589'7955714  2023-01-01T10:06:25.077020+0000   10676452'7955711  2022-12-31T09:54:07.920177+0000              0                   21  queued for deep scrub                                                      137979                0
119.a2     65720                   0         0      65720        0  118804757552            0           0   7590      7590                      active+remapped+backfilling  2023-01-02T16:08:24.588335+0000   10682782'3613140   10682782:9884637      [44,117,63,18,30,77]          44      [41,117,63,18,30,77]              41   10676728'3602120  2023-01-01T16:44:11.572865+0000   10676728'3602120  2023-01-01T16:44:11.572865+0000              0                 1969  queued for deep scrub                                                       60092                0
0.ee        2700                   0         0       4900        0    9716980648            0           0   2711      2711                      active+remapped+backfilling  2023-01-02T16:08:26.105168+0000   10675306'5989142  10682782:17648413               [90,71,160]          90                [41,71,51]              41   10675306'5989142  2023-01-02T03:57:15.992196+0000   10675306'5989142  2023-01-02T03:57:15.992196+0000              0                  203  queued for deep scrub                                                        2700                0
119.13e    65851                   0         0      65573        0  118382459569            0           0   7567      7567                      active+remapped+backfilling  2023-01-02T16:08:14.310947+0000   10682783'3599482  10682783:11087707      [44,61,123,12,48,28]          44      [41,61,123,12,48,28]              41   10676278'3588377  2022-12-31T04:50:40.590459+0000   10676237'3588314  2022-12-29T21:44:43.925143+0000              0                   11  queued for deep scrub                                                       60305                0
0.310       2534                   0         0       2534        0    9257925379            0           0   2680      2680                      active+remapped+backfilling  2023-01-02T16:08:14.354476+0000   10675347'6226589  10682782:17549076               [24,76,155]          24               [41,76,155]              41   10675347'6226589  2023-01-02T02:36:57.124264+0000   10675347'6226589  2022-12-29T13:17:02.127406+0000              0                    1  queued for deep scrub                                                        2534                0
119.31a    65464                   0         0      65464        0  117938540621            0           0   7589      7589                      active+remapped+backfilling  2023-01-02T16:08:14.354365+0000   10682783'3605046  10682783:11453191       [43,160,5,89,16,72]          43       [41,160,5,89,16,72]              41   10676772'3594695  2023-01-01T23:23:55.938149+0000   10676278'3594272  2022-12-31T21:12:35.209263+0000              0                   11  queued for deep scrub                                                       60198                0
dumped all
119.304    65530                   0         0      65530        0  119050260451            0           0   7634      7634                      active+remapped+backfilling  2023-01-02T16:08:20.108832+0000   10682783'3618838  10682783:10395163         [26,21,68,3,7,84]          26         [41,21,68,3,7,84]              41   10676768'3607766  2023-01-01T22:12:15.129272+0000   10675906'3603297  2022-12-28T06:31:23.333029+0000              0                   11  queued for deep scrub                                                       60027                0
0.3c8       2614                   0         0       2354        0    9288911888            0           0   2667      2667                      active+remapped+backfilling  2023-01-02T16:08:14.400845+0000   10675405'6245090  10682782:18518475                [42,61,90]          42                [41,61,90]              41   10675405'6245090  2023-01-01T03:27:31.597761+0000   10675405'6245090  2022-12-31T01:35:47.818437+0000              0                    1  queued for deep scrub                             

Note, all but 2 have 41 as the primary. Earlier in the day, there were another 16 with 52 as the primary. Marking 52 down resolved those stuck pgs. ceph pg <pgid> query hung for those pgs with 52 as the primary at the time. The ones with 41 as the primary, however, do not. Attached is a file with the pg query results for those pgs.

pg_query_output.json.gz - pg query output for pgs stuck, mainly with 41 as primary (279 KB) Samuel Just, 01/19/2023 04:29 AM

osd_128_from_file_attempt_2.log.xz (456 KB) Samuel Just, 01/19/2023 09:09 PM


Related issues

Related to RADOS - Bug #58529: osd: very slow recovery due to delayed push reply messages Resolved

History

#1 Updated by Samuel Just about 1 year ago

  • File osd_128.log.xz added

One pg, the one on osd.128, continued to hang after marking the osd down. I enabled logging on the primary and marked it down again. Here's the log from that period.

#2 Updated by Samuel Just about 1 year ago

  • File osd_128_from_file_114_d6.log.xz added

#3 Updated by Samuel Just about 1 year ago

  • File deleted (osd_128.log.xz)

#4 Updated by Samuel Just about 1 year ago

  • File deleted (osd_128_from_file_114_d6.log.xz)

#5 Updated by Samuel Just about 1 year ago

Attached logs from 128 from before marking down to after backfill appears to stall. (Deleted earlier attached logs from journald which dropped the important log lines)

#6 Updated by Samuel Just about 1 year ago

Re-downed osd.128 with the op tracker enabled:

sjust@reesi006:~/2023-01-18-lrc-investigation/reesi006$ sudo ceph daemon /var/run/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/ceph-osd.128.asok dump_ops_in_flight                                                                                                                                                                                                                                                                                                                              
{                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
    "ops": [
        {
            "description": "MOSDPGPushReply(114.d6 10682941/10682940 [PushReplyOp(114:6b6fbf02:::10000182a50.00000000:head),PushReplyOp(114:6b6fbf03:::100283d58f9.00000000:head),PushReplyOp(114:6b6fbf54:::10026df4607.00000000:head),PushReplyOp(114:6b6fbf5d:::10025cf49f0.00000000:head),PushReplyOp(114:6b6fbf63:::10028320adc.00000000:head)])",
            "initiated_at": "2023-01-19T21:17:59.924880+0000",
            "age": 1.9357628600000001,
            "duration": 1.935820858,
            "type_data": {
                "flag_point": "queued for pg",
                "events": [
                    {
                        "event": "initiated",
                        "time": "2023-01-19T21:17:59.924880+0000",
                        "duration": 0
                    },
                    {
                        "event": "throttled",
                        "time": "2023-01-19T21:17:59.924880+0000",
                        "duration": 0
                    },
                    {
                        "event": "header_read",
                        "time": "2023-01-19T21:17:59.924880+0000",
                        "duration": 4294967295.9999995
                    },
                    {
                        "event": "all_read",
                        "time": "2023-01-19T21:17:59.924889+0000",
                        "duration": 9.5200000000000003e-06
                    },
                    {
                        "event": "dispatched",
                        "time": "2023-01-19T21:17:59.924890+0000",
                        "duration": 4.7700000000000005e-07
                    },
                    {
                        "event": "queued_for_pg",
                        "time": "2023-01-19T21:17:59.924894+0000",
                        "duration": 4.5070000000000002e-06
                    }
                ]
            }
        }
    ],
    "num_ops": 1
}

Each recovery reply is being delayed about 5s.

#7 Updated by Samuel Just about 1 year ago

mclock really shouldn't be delaying replies.

#8 Updated by Sridhar Seshasayee about 1 year ago

I took a look at the logs for pg114.d6 attached to this tracker. The cost for the push replies is calculated to over
80_M! I think this is calculated by compute_cost(). The base cost is 8_M. This indicates multiple replies being sent
as part of MOSDPGPushReply. The corresponding qos_cost is determined to be 436 as shown below for a PGRecoveryMsg.
This is quite high, and in general higher the cost, the longer an item stays in the mclock queue. Each item stays in
the mclock queue for a few seconds (4-6 secs) based on the osd capacity and allocations set for recovery ops.
Shown below is one such log for a push reply item:

2023-01-19T20:59:49.909+0000 7f2f370cb700 20 osd.128 op_wq(4) _process OpSchedulerItem(114.d6 PGRecoveryMsg(op=MOSDPGPushReply(114.d6 10682931/10682929 [PushReplyOp(114:6b6e3b2f:::1001e1b90fb.00000000:head),PushReplyOp(114:6b6e3bff:::10011ed22d6.00000000:head),PushReplyOp(114:6b6e3d00:::10028d57803.00000000:head),PushReplyOp(114:6b6e3d4c:::10028375b85.00000000:head),PushReplyOp(114:6b6e3e50:::10028a7159b.00000000:head)]) v3) class_id 0 qos_cost 436 cost 83891080 e10682931) queued

The push reply object maintains a vector of PushReplyOp (replies). This vector is used to compute the overall cost which
adds up to over 80_M.

void compute_cost(CephContext *cct) {
  cost = 0;
  for (auto i = replies.begin(); i != replies.end(); ++i) {
    cost += i->cost(cct);
  }
}

Looking further, I also see:

 2023-01-19T21:00:38.358+0000 7f2f370cb700 10 osd.128 10682935 _maybe_queue_recovery starting 5, recovery_ops_reserved 0 -> 5

The above indicates `osd_recovery_max_single_start` is set to 5 (default: 1) which is probably resulting in the increased
cost for push reply ops as there will be as many PushReplyOp cost aggregated for the MOSDPGPushReply item.

As an immediate solution, `osd_recovery_max_single_start` can be set back to 1 and this should help the push reply ops
to be scheduled faster.

Longer term we have a few alternatives:

  1. Prevent modification of `osd_recovery_max_single_start` similar to the way we prevent modification of the various sleep options with mclock enabled.
  2. Audit the costs for the various background ops and modify them to work well with mclock. This is of course ensuring backward compatibility with the 'wpq' scheduler.
  3. Consider reply ops with higher priority and put them into a higher priority queue (currently being implemented).The higher priority queue is not managed by mclock and therefore, these ops can be completed faster.

The above can be discussed and the best approach adopted.

#9 Updated by Samuel Just about 1 year ago

  • Related to Bug #58529: osd: very slow recovery due to delayed push reply messages added

#10 Updated by Samuel Just about 1 year ago

I opened https://tracker.ceph.com/issues/58529 to track the slow push reply behavior since it's probably unrelated to the original hang. See my comment there.

Also available in: Atom PDF