Bug #64869: rados/thrash: slow reservation response from 1 (115547ms) in cluster log - RADOS - Ceph

Actions

Copy link

Bug #64869

open

rados/thrash: slow reservation response from 1 (115547ms) in cluster log

Added by Sridhar Seshasayee about 2 months ago. Updated about 1 month ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Ronen Friedman

Category:

Target version:

% Done:

Source:

Tags:

backport_processed

Backport:

squid

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

rados

Component(RADOS):

OSD

Pull request ID:

56337

Crash signature (v1):

Crash signature (v2):

Description

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587833

The cluster log warning is raised since scrub reservation exceeded the default timeout. I didn't
look into the specifics further. Should this be added to the ignorelist?

Logs:

2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process 4.c to_process <OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270)> waiting <> waiting_peering {}
2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270) pg 0x5635ab21c000
2024-03-10T02:31:32.736+0000 7f70cdae0640 10 osd.6 330 dequeue_op MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 prio 127 cost 0 latency 0.402707 MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 pg pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ]
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 330 maybe_share_map: con v1:172.21.15.67:6809/4149191103 our osdmap epoch of 330 is not newer than session's projected_epoch of 330
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] _handle_message: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] scrubber<Session/ReservingReplicas>: handle_scrub_reserve_msgs: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3
2024-03-10T02:31:32.740+0000 7f70cdae0640 20  scrubberFSM  event: --vvvv---- ReplicaGrant
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] scrubber<Session/ReservingReplicas>: FSM: ReservingReplicas::react(const ReplicaGrant&)
2024-03-10T02:31:32.740+0000 7f70cdae0640  0 log_channel(cluster) log [WRN] : slow reservation response from 1 (115547ms)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Sridhar Seshasayee about 2 months ago

ceph-qa-suite rados added
Component(RADOS) OSD added

Actions

Copy link

Updated by Radoslaw Zarzynski about 2 months ago

Assignee set to Ronen Friedman

Actions

Copy link

Updated by Ronen Friedman about 2 months ago

Status changed from New to In Progress

The problem is: how to differentiate between instances where one of the scrub reservation messages is queued as 'waiting for pg' (usually because there was a map update and possibly (*) a map crc error) or as a result of a backlog of slow ops on that PG), and instances in which there is a communication delay between two OSDs in a real client cluster. In the first scenario - which is usually encountered during testing - we do not wish to see those warnings. But the second one is real, and is why we have this warning in the first place,

There is also probably a bug that allowed the timer to be set for queued scrub resource requests, where they shouldn't be.

Looking into it.

Actions

Copy link