Bug #52568: RadosGW's hang when OSD's are in slow OPS state - rgw - Ceph

Actions

Copy link

Bug #52568

open

RadosGW's hang when OSD's are in slow OPS state

Added by ronnie laptop over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

rgw

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v15.2.13

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

During OSD recovery periods we observe RGW hang issues in our PROD environment, and from observing replication behaviour we consider one of the issues a bug in the RGW.

With an cluster layout of:

37 hosts total hosts
8 hosts with NVME OSD's for Journals
4 hosts for MGR/MON/RGW/MDS
25 hosts spinning disk OSD's
network between hosts is 50Gb/s active/active bond per node.
some OSD's are still on 15.2.7 as we need to migrate these OSD's to LVM structure (now XFS) which is slowly progressing.
all other containers run 15.2.13

When 1 or more spinning disks are replaced, and CEPH enters degraded state, replication starts, and we see this reflected on the IO of the spinning disks which is normal. With normal recovery settings we observer ~15MiB/s , 50 objects/s recovery which is quite slow.

However when we try to speed up the recovery with the following settings,

ceph config set osd osd_max_backfills 6
ceph config set osd osd_recovery_max_active 18

traffic between all the OSD's increases, and some disks could enter in such a busy state stat the 'slow OPS for OSD xxx' is reported. This in itself should be OK, we could impact 'client IO', but this would be acceptable.

However, at some point, all the RGW process freeze, and they would only recover when we restart one or more of the OSD's ,which report for the slow OPS. During these freeze, all client IO is stopped.

We cannot really figure out what is the problem, besides that we think that the RGW should recover by itself from these states.

Related issues 1 (1 open — 0 closed)