Bug #56733: Since Pacific upgrade, sporadic latencies plateau on random OSD/disks - RADOS - Ceph

Actions

Copy link

Bug #56733

open

Since Pacific upgrade, sporadic latencies plateau on random OSD/disks

Added by Gilles Mocellin almost 2 years ago. Updated over 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello,

Since our upgrade to Pacific, we suffer from sporadic latencies on disks, not always the same.
The cluster is backing an OpenStack Cloud, and VM workloads are very impacted during these latency episodes.

As we have SSD for rocksdb+wal, I found #56488, and increasing bluestore_prefer_deferred_size_hdd like mentioned have helped a bit, but we continue to see some high latency periods (30min to 1h), around 900ms (HDD). At 1ms, everything halts, and we have slow ops.

It seems, but I don't have enought occurence to be sure, that setting noscrub+nodeep-scrub during the event, stop it.

What's also unexpeted, is that the latency is on reads, impacting writes, not the reverse. During the latency plateau, there is not so much IOPS (~10) or Throughput (20MB/s) on the impacted drive.

The drive is OK in its SMART infos.
I've tried to remove the first drive/OSD I found having that problem, several times. But the problem happened on other OSD/drives.

Files

Download all files

Capture d’écran du 2022-08-01 17-01-25.png (225 KB) Capture d’écran du 2022-08-01 17-01-25.png	Dashboard graphs for the impacted OSD/drive	Gilles Mocellin, 08/01/2022 03:01 PM
logs.tar.gz (21.4 KB) logs.tar.gz	original-osd.0.perf during-osd.0.perf iostat-osd.0.out ceph-osd.0.log ceph-mon.log	Gilles Mocellin, 08/05/2022 01:25 PM
logs.tar.gz (101 KB) logs.tar.gz		Gilles Mocellin, 08/09/2022 12:35 PM

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #56733

Since Pacific upgrade, sporadic latencies plateau on random OSD/disks

Updated by Gilles Mocellin almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Gilles Mocellin almost 2 years ago

Updated by Gilles Mocellin almost 2 years ago

Updated by Igor Fedotov almost 2 years ago

Updated by Gilles Mocellin over 1 year ago

Updated by Gilles Mocellin over 1 year ago

Updated by Gilles Mocellin over 1 year ago

Updated by Adam Kupczyk over 1 year ago

Updated by Radoslaw Zarzynski over 1 year ago