Bug #58498
ceph: pgs stuck backfilling
0%
Description
Observed on the LRC cluster after resolving https://tracker.ceph.com/issues/58460
cluster: id: 28f7427e-5558-4ffd-ae1a-51ec3042759a health: HEALTH_WARN Degraded data redundancy: 145548/540755983 objects degraded (0.027%), 2 pgs degraded, 2 pgs undersized 1470 pgs not deep-scrubbed in time 1470 pgs not scrubbed in time services: mon: 5 daemons, quorum reesi003,reesi002,reesi001,ivan02,ivan01 (age 3d) mgr: reesi006.erytot(active, since 3w), standbys: reesi005.xxyjcw, reesi004.tplfrt mds: 4/4 daemons up, 5 standby, 1 hot standby osd: 166 osds: 166 up (since 5h), 166 in (since 13d); 16 remapped pgs rgw: 2 daemons active (2 hosts, 1 zones) tcmu-runner: 4 portals active (4 hosts) data: volumes: 4/4 healthy pools: 24 pools, 2965 pgs objects: 112.97M objects, 127 TiB usage: 224 TiB used, 836 TiB / 1.0 PiB avail pgs: 145548/540755983 objects degraded (0.027%) 665241/540755983 objects misplaced (0.123%) 2942 active+clean 14 active+remapped+backfilling 7 active+clean+scrubbing+deep 2 active+undersized+degraded+remapped+backfilling io: client: 177 KiB/s rd, 4.7 MiB/s wr, 6 op/s rd, 28 op/s wr recovery: 0 B/s, 6 objects/s progress: Global Recovery Event (2w) [===========================.] (remaining: 2h)
For most of the pgs, recovery is not proceeding at all.
sjust@reesi002:~/2023-01-18-lrc-investigation/reesi002$ sudo ceph pg dump | grep backfilling | grep -v 'dumped all' 0.24d 2540 0 0 2540 0 9184352048 0 0 2664 2664 active+remapped+backfilling 2023-01-02T16:08:14.354424+0000 10675336'5920294 10682782:18541558 [44,98,58] 44 [41,98,58] 41 10675336'5920294 2023-01-02T11:11:45.415900+0000 10675336'5920294 2022-12-31T04:42:51.776041+0000 0 1 queued for deep scrub 2540 0 119.25d 65450 0 0 60524 0 118040914986 0 0 7635 7635 active+remapped+backfilling 2023-01-02T16:08:14.161260+0000 10682783'3606255 10682783:10902266 [27,161,89,120,72,125] 27 [41,161,89,120,72,125] 41 10676278'3594937 2022-12-31T01:03:38.315674+0000 10675860'3589907 2022-12-27T05:27:21.909004+0000 0 13 queued for deep scrub 59930 0 0.1a7 2547 0 0 2277 0 9240477039 0 0 2708 2708 active+remapped+backfilling 2023-01-02T16:08:17.214871+0000 10675299'6355360 10682782:17752489 [42,134,131] 42 [41,134,131] 41 10675299'6355360 2023-01-02T05:19:35.110378+0000 10675299'6355360 2023-01-02T05:19:35.110378+0000 0 254 queued for deep scrub 2547 0 0.19d 2698 0 0 2418 0 9774354741 0 0 2658 2658 active+remapped+backfilling 2023-01-02T16:08:14.289081+0000 10675706'6186906 10682782:17607234 [42,152,110] 42 [41,152,110] 41 10675706'6186906 2023-01-01T17:59:01.366870+0000 10675706'6186906 2022-12-31T14:38:58.383181+0000 0 1 queued for deep scrub 2698 0 119.d3 65969 0 0 193815 0 117703625513 0 0 7607 7607 active+remapped+backfilling 2023-01-02T16:08:23.978465+0000 10682783'3611895 10682783:11308049 [44,118,94,111,89,14] 44 [41,49,94,111,89,118] 41 10676765'3601252 2023-01-01T21:58:55.209069+0000 10676765'3601252 2023-01-01T21:58:55.209069+0000 0 1345 queued for deep scrub 60475 0 114.d6 152158 0 136190 0 0 21296461 0 0 3197 3197 active+undersized+degraded+remapped+backfilling 2023-01-18T23:08:22.508697+0000 10682783'8022177 10682783:16372025 [59,118,128] 59 [128,118] 128 10676768'8007280 2023-01-01T22:12:37.218704+0000 10676100'8002894 2022-12-28T20:25:52.849234+0000 0 21 queued for deep scrub 138310 0 119.1b 65967 0 0 65967 0 119233083117 0 0 7638 7638 active+remapped+backfilling 2023-01-02T16:08:18.617519+0000 10682782'3597673 10682782:10560296 [27,161,114,29,134,56] 27 [41,161,114,29,134,56] 41 10676772'3586960 2023-01-01T23:15:29.597688+0000 10676242'3586378 2022-12-30T01:35:48.857076+0000 0 11 queued for deep scrub 60571 0 119.1c 65933 0 0 65626 0 119149682806 0 0 7545 7545 active+remapped+backfilling 2023-01-02T16:08:14.309859+0000 10682782'3598669 10682782:11055329 [24,59,69,16,127,68] 24 [41,59,69,16,127,68] 41 10676872'3587860 2023-01-02T15:23:00.227516+0000 10676872'3587860 2023-01-02T15:23:00.227516+0000 0 1960 queued for deep scrub 60925 0 114.70 152214 0 9019 0 0 50474801 0 0 2552 2552 active+undersized+degraded+remapped+backfilling 2023-01-18T23:08:22.515880+0000 10682783'7970909 10682783:17076347 [49,161,119] 49 [119,161] 119 10676589'7955714 2023-01-01T10:06:25.077020+0000 10676452'7955711 2022-12-31T09:54:07.920177+0000 0 21 queued for deep scrub 137979 0 119.a2 65720 0 0 65720 0 118804757552 0 0 7590 7590 active+remapped+backfilling 2023-01-02T16:08:24.588335+0000 10682782'3613140 10682782:9884637 [44,117,63,18,30,77] 44 [41,117,63,18,30,77] 41 10676728'3602120 2023-01-01T16:44:11.572865+0000 10676728'3602120 2023-01-01T16:44:11.572865+0000 0 1969 queued for deep scrub 60092 0 0.ee 2700 0 0 4900 0 9716980648 0 0 2711 2711 active+remapped+backfilling 2023-01-02T16:08:26.105168+0000 10675306'5989142 10682782:17648413 [90,71,160] 90 [41,71,51] 41 10675306'5989142 2023-01-02T03:57:15.992196+0000 10675306'5989142 2023-01-02T03:57:15.992196+0000 0 203 queued for deep scrub 2700 0 119.13e 65851 0 0 65573 0 118382459569 0 0 7567 7567 active+remapped+backfilling 2023-01-02T16:08:14.310947+0000 10682783'3599482 10682783:11087707 [44,61,123,12,48,28] 44 [41,61,123,12,48,28] 41 10676278'3588377 2022-12-31T04:50:40.590459+0000 10676237'3588314 2022-12-29T21:44:43.925143+0000 0 11 queued for deep scrub 60305 0 0.310 2534 0 0 2534 0 9257925379 0 0 2680 2680 active+remapped+backfilling 2023-01-02T16:08:14.354476+0000 10675347'6226589 10682782:17549076 [24,76,155] 24 [41,76,155] 41 10675347'6226589 2023-01-02T02:36:57.124264+0000 10675347'6226589 2022-12-29T13:17:02.127406+0000 0 1 queued for deep scrub 2534 0 119.31a 65464 0 0 65464 0 117938540621 0 0 7589 7589 active+remapped+backfilling 2023-01-02T16:08:14.354365+0000 10682783'3605046 10682783:11453191 [43,160,5,89,16,72] 43 [41,160,5,89,16,72] 41 10676772'3594695 2023-01-01T23:23:55.938149+0000 10676278'3594272 2022-12-31T21:12:35.209263+0000 0 11 queued for deep scrub 60198 0 dumped all 119.304 65530 0 0 65530 0 119050260451 0 0 7634 7634 active+remapped+backfilling 2023-01-02T16:08:20.108832+0000 10682783'3618838 10682783:10395163 [26,21,68,3,7,84] 26 [41,21,68,3,7,84] 41 10676768'3607766 2023-01-01T22:12:15.129272+0000 10675906'3603297 2022-12-28T06:31:23.333029+0000 0 11 queued for deep scrub 60027 0 0.3c8 2614 0 0 2354 0 9288911888 0 0 2667 2667 active+remapped+backfilling 2023-01-02T16:08:14.400845+0000 10675405'6245090 10682782:18518475 [42,61,90] 42 [41,61,90] 41 10675405'6245090 2023-01-01T03:27:31.597761+0000 10675405'6245090 2022-12-31T01:35:47.818437+0000 0 1 queued for deep scrub
Note, all but 2 have 41 as the primary. Earlier in the day, there were another 16 with 52 as the primary. Marking 52 down resolved those stuck pgs. ceph pg <pgid> query hung for those pgs with 52 as the primary at the time. The ones with 41 as the primary, however, do not. Attached is a file with the pg query results for those pgs.
Related issues
History
#1 Updated by Samuel Just about 1 year ago
- File osd_128.log.xz added
One pg, the one on osd.128, continued to hang after marking the osd down. I enabled logging on the primary and marked it down again. Here's the log from that period.
#2 Updated by Samuel Just about 1 year ago
- File osd_128_from_file_114_d6.log.xz added
#3 Updated by Samuel Just about 1 year ago
- File deleted (
osd_128.log.xz)
#4 Updated by Samuel Just about 1 year ago
- File deleted (
osd_128_from_file_114_d6.log.xz)
#5 Updated by Samuel Just about 1 year ago
- File osd_128_from_file_attempt_2.log.xz added
Attached logs from 128 from before marking down to after backfill appears to stall. (Deleted earlier attached logs from journald which dropped the important log lines)
#6 Updated by Samuel Just about 1 year ago
Re-downed osd.128 with the op tracker enabled:
sjust@reesi006:~/2023-01-18-lrc-investigation/reesi006$ sudo ceph daemon /var/run/ceph/28f7427e-5558-4ffd-ae1a-51ec3042759a/ceph-osd.128.asok dump_ops_in_flight { "ops": [ { "description": "MOSDPGPushReply(114.d6 10682941/10682940 [PushReplyOp(114:6b6fbf02:::10000182a50.00000000:head),PushReplyOp(114:6b6fbf03:::100283d58f9.00000000:head),PushReplyOp(114:6b6fbf54:::10026df4607.00000000:head),PushReplyOp(114:6b6fbf5d:::10025cf49f0.00000000:head),PushReplyOp(114:6b6fbf63:::10028320adc.00000000:head)])", "initiated_at": "2023-01-19T21:17:59.924880+0000", "age": 1.9357628600000001, "duration": 1.935820858, "type_data": { "flag_point": "queued for pg", "events": [ { "event": "initiated", "time": "2023-01-19T21:17:59.924880+0000", "duration": 0 }, { "event": "throttled", "time": "2023-01-19T21:17:59.924880+0000", "duration": 0 }, { "event": "header_read", "time": "2023-01-19T21:17:59.924880+0000", "duration": 4294967295.9999995 }, { "event": "all_read", "time": "2023-01-19T21:17:59.924889+0000", "duration": 9.5200000000000003e-06 }, { "event": "dispatched", "time": "2023-01-19T21:17:59.924890+0000", "duration": 4.7700000000000005e-07 }, { "event": "queued_for_pg", "time": "2023-01-19T21:17:59.924894+0000", "duration": 4.5070000000000002e-06 } ] } } ], "num_ops": 1 }
Each recovery reply is being delayed about 5s.
#7 Updated by Samuel Just about 1 year ago
mclock really shouldn't be delaying replies.
#8 Updated by Sridhar Seshasayee about 1 year ago
I took a look at the logs for pg114.d6 attached to this tracker. The cost for the push replies is calculated to over
80_M! I think this is calculated by compute_cost(). The base cost is 8_M. This indicates multiple replies being sent
as part of MOSDPGPushReply. The corresponding qos_cost is determined to be 436 as shown below for a PGRecoveryMsg.
This is quite high, and in general higher the cost, the longer an item stays in the mclock queue. Each item stays in
the mclock queue for a few seconds (4-6 secs) based on the osd capacity and allocations set for recovery ops.
Shown below is one such log for a push reply item:
2023-01-19T20:59:49.909+0000 7f2f370cb700 20 osd.128 op_wq(4) _process OpSchedulerItem(114.d6 PGRecoveryMsg(op=MOSDPGPushReply(114.d6 10682931/10682929 [PushReplyOp(114:6b6e3b2f:::1001e1b90fb.00000000:head),PushReplyOp(114:6b6e3bff:::10011ed22d6.00000000:head),PushReplyOp(114:6b6e3d00:::10028d57803.00000000:head),PushReplyOp(114:6b6e3d4c:::10028375b85.00000000:head),PushReplyOp(114:6b6e3e50:::10028a7159b.00000000:head)]) v3) class_id 0 qos_cost 436 cost 83891080 e10682931) queued
The push reply object maintains a vector of PushReplyOp (replies). This vector is used to compute the overall cost which
adds up to over 80_M.
void compute_cost(CephContext *cct) {
cost = 0;
for (auto i = replies.begin(); i != replies.end(); ++i) {
cost += i->cost(cct);
}
}
Looking further, I also see:
2023-01-19T21:00:38.358+0000 7f2f370cb700 10 osd.128 10682935 _maybe_queue_recovery starting 5, recovery_ops_reserved 0 -> 5
The above indicates `osd_recovery_max_single_start` is set to 5 (default: 1) which is probably resulting in the increased
cost for push reply ops as there will be as many PushReplyOp cost aggregated for the MOSDPGPushReply item.
As an immediate solution, `osd_recovery_max_single_start` can be set back to 1 and this should help the push reply ops
to be scheduled faster.
Longer term we have a few alternatives:
- Prevent modification of `osd_recovery_max_single_start` similar to the way we prevent modification of the various sleep options with mclock enabled.
- Audit the costs for the various background ops and modify them to work well with mclock. This is of course ensuring backward compatibility with the 'wpq' scheduler.
- Consider reply ops with higher priority and put them into a higher priority queue (currently being implemented).The higher priority queue is not managed by mclock and therefore, these ops can be completed faster.
The above can be discussed and the best approach adopted.
#9 Updated by Samuel Just about 1 year ago
- Related to Bug #58529: osd: very slow recovery due to delayed push reply messages added
#10 Updated by Samuel Just about 1 year ago
I opened https://tracker.ceph.com/issues/58529 to track the slow push reply behavior since it's probably unrelated to the original hang. See my comment there.