Bug #64869
openrados/thrash: slow reservation response from 1 (115547ms) in cluster log
0%
Description
/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587833
The cluster log warning is raised since scrub reservation exceeded the default timeout. I didn't
look into the specifics further. Should this be added to the ignorelist?
Logs:
2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process 4.c to_process <OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270)> waiting <> waiting_peering {} 2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270) pg 0x5635ab21c000 2024-03-10T02:31:32.736+0000 7f70cdae0640 10 osd.6 330 dequeue_op MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 prio 127 cost 0 latency 0.402707 MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 pg pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c: ] ] 2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 330 maybe_share_map: con v1:172.21.15.67:6809/4149191103 our osdmap epoch of 330 is not newer than session's projected_epoch of 330 2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c: ] ] _handle_message: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c: ] ] scrubber<Session/ReservingReplicas>: handle_scrub_reserve_msgs: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 2024-03-10T02:31:32.740+0000 7f70cdae0640 20 scrubberFSM event: --vvvv---- ReplicaGrant 2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c: ] ] scrubber<Session/ReservingReplicas>: FSM: ReservingReplicas::react(const ReplicaGrant&) 2024-03-10T02:31:32.740+0000 7f70cdae0640 0 log_channel(cluster) log [WRN] : slow reservation response from 1 (115547ms)
Updated by Sridhar Seshasayee about 2 months ago
- ceph-qa-suite rados added
- Component(RADOS) OSD added
Updated by Ronen Friedman about 2 months ago
- Status changed from New to In Progress
The problem is: how to differentiate between instances where one of the scrub reservation messages is queued as 'waiting for pg' (usually because there was a map update and possibly (*) a map crc error) or as a result of a backlog of slow ops on that PG), and instances in which there is a communication delay between two OSDs in a real client cluster. In the first scenario - which is usually encountered during testing - we do not wish to see those warnings. But the second one is real, and is why we have this warning in the first place,
There is also probably a bug that allowed the timer to be set for queued scrub resource requests, where they shouldn't be.
Looking into it.
Updated by Ronen Friedman about 2 months ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 56337
Updated by Radoslaw Zarzynski about 1 month ago
- Backport set to squid
Updating the backport field per today's core-sync.
Updated by Ronen Friedman about 1 month ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 month ago
- Copied to Backport #65072: squid: rados/thrash: slow reservation response from 1 (115547ms) in cluster log added