Project

General

Profile

Actions

Bug #64869

open

rados/thrash: slow reservation response from 1 (115547ms) in cluster log

Added by Sridhar Seshasayee about 2 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/yuriw-2024-03-08_16:20:46-rados-wip-yuri4-testing-2024-03-05-0854-distro-default-smithi/7587833

The cluster log warning is raised since scrub reservation exceeded the default timeout. I didn't
look into the specifics further. Should this be added to the ignorelist?

Logs:

2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process 4.c to_process <OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270)> waiting <> waiting_peering {}
2024-03-10T02:31:32.736+0000 7f70cdae0640 20 osd.6 op_wq(4) _process OpSchedulerItem(4.c PGOpItem(op=MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3) class_id 2 prio 127 cost 0 e270) pg 0x5635ab21c000
2024-03-10T02:31:32.736+0000 7f70cdae0640 10 osd.6 330 dequeue_op MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 prio 127 cost 0 latency 0.402707 MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3 pg pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ]
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 330 maybe_share_map: con v1:172.21.15.67:6809/4149191103 our osdmap epoch of 330 is not newer than session's projected_epoch of 330
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] _handle_message: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] scrubber<Session/ReservingReplicas>: handle_scrub_reserve_msgs: MOSDScrubReserve(4.c,GRANT,e:270,from:1,reservation_nonce:0) v3
2024-03-10T02:31:32.740+0000 7f70cdae0640 20  scrubberFSM  event: --vvvv---- ReplicaGrant
2024-03-10T02:31:32.740+0000 7f70cdae0640 10 osd.6 pg_epoch: 330 pg[4.c( v 143'128 (136'125,143'128] local-lis/les=259/260 n=4 ec=24/24 lis/c=259/259 les/c/f=260/260/0 sis=259) [6,3,1] r=0 lpr=259 crt=143'128 lcod 143'127 mlcod 0'0 active+clean+scrubbing [ 4.c:  ] ] scrubber<Session/ReservingReplicas>: FSM: ReservingReplicas::react(const ReplicaGrant&)
2024-03-10T02:31:32.740+0000 7f70cdae0640  0 log_channel(cluster) log [WRN] : slow reservation response from 1 (115547ms)

Related issues 1 (1 open0 closed)

Copied to RADOS - Backport #65072: squid: rados/thrash: slow reservation response from 1 (115547ms) in cluster logNewRonen FriedmanActions
Actions #1

Updated by Sridhar Seshasayee about 2 months ago

  • ceph-qa-suite rados added
  • Component(RADOS) OSD added
Actions #2

Updated by Radoslaw Zarzynski about 1 month ago

  • Assignee set to Ronen Friedman
Actions #3

Updated by Ronen Friedman about 1 month ago

  • Status changed from New to In Progress

The problem is: how to differentiate between instances where one of the scrub reservation messages is queued as 'waiting for pg' (usually because there was a map update and possibly (*) a map crc error) or as a result of a backlog of slow ops on that PG), and instances in which there is a communication delay between two OSDs in a real client cluster. In the first scenario - which is usually encountered during testing - we do not wish to see those warnings. But the second one is real, and is why we have this warning in the first place,

There is also probably a bug that allowed the timer to be set for queued scrub resource requests, where they shouldn't be.

Looking into it.

Actions #4

Updated by Ronen Friedman about 1 month ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 56337
Actions #5

Updated by Radoslaw Zarzynski about 1 month ago

  • Backport set to squid

Updating the backport field per today's core-sync.

Actions #6

Updated by Ronen Friedman about 1 month ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot about 1 month ago

  • Copied to Backport #65072: squid: rados/thrash: slow reservation response from 1 (115547ms) in cluster log added
Actions #8

Updated by Backport Bot about 1 month ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF