Project

General

Profile

Actions

Bug #52568

open

RadosGW's hang when OSD's are in slow OPS state

Added by ronnie laptop over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
rgw
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During OSD recovery periods we observe RGW hang issues in our PROD environment, and from observing replication behaviour we consider one of the issues a bug in the RGW.

With an cluster layout of:
  • 37 hosts total hosts
  • 8 hosts with NVME OSD's for Journals
  • 4 hosts for MGR/MON/RGW/MDS
  • 25 hosts spinning disk OSD's
  • network between hosts is 50Gb/s active/active bond per node.
  • some OSD's are still on 15.2.7 as we need to migrate these OSD's to LVM structure (now XFS) which is slowly progressing.
  • all other containers run 15.2.13

When 1 or more spinning disks are replaced, and CEPH enters degraded state, replication starts, and we see this reflected on the IO of the spinning disks which is normal. With normal recovery settings we observer ~15MiB/s , 50 objects/s recovery which is quite slow.

However when we try to speed up the recovery with the following settings,
  • ceph config set osd osd_max_backfills 6
  • ceph config set osd osd_recovery_max_active 18

traffic between all the OSD's increases, and some disks could enter in such a busy state stat the 'slow OPS for OSD xxx' is reported. This in itself should be OK, we could impact 'client IO', but this would be acceptable.

However, at some point, all the RGW process freeze, and they would only recover when we restart one or more of the OSD's ,which report for the slow OPS. During these freeze, all client IO is stopped.

We cannot really figure out what is the problem, besides that we think that the RGW should recover by itself from these states.


Related issues 1 (1 open0 closed)

Related to rgw - Feature #51655: No automatic alarm / recovery for unresponsive RGWNew

Actions
Actions #1

Updated by Casey Bodley over 2 years ago

  • Related to Feature #51655: No automatic alarm / recovery for unresponsive RGW added
Actions #2

Updated by Loïc Dachary over 2 years ago

  • Target version deleted (v15.2.15)
Actions

Also available in: Atom PDF