Bug #52523: Latency spikes causing timeouts after upgrade to pacific (16.2.5) - RADOS - Ceph

Actions

Copy link

Bug #52523

closed

Latency spikes causing timeouts after upgrade to pacific (16.2.5)

Added by Roland Sommer over 2 years ago. Updated over 2 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.5

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After having run pacific in our low volume staging system for 2 months, yesterday we upgraded our production cluster from octopus to pacific. When the first OSD-Node was upgraded, we saw timeouts in several applications accessing the ceph cluster via radosgw. At first, we thought that was caused due to rebalancing/recovery, but the problem did persist after the cluster was green again. Every 10 Minutes, we experience a rise in latency. After seeking for a possible cause, we could see in our metric system that disk IO has gone up since the restart of the OSD-Nodes. Searching further, we saw that the interval ingest volume in the rocksdb statistics has also gone up by around factor 200 (see attached log excerpt).

The IO has increased on the blockdb-devices (which are SSDs), IO on the data devices (normal HDDs) has not increased. The general setup is one SSD with multiple LVM volumes which act as blockdb devices for the HDDs (everything on bluestore).

Files

Download all files

ingest-logs.txt (2.29 KB) ingest-logs.txt		Roland Sommer, 09/07/2021 10:00 AM
disk-io.png (252 KB) disk-io.png		Roland Sommer, 09/07/2021 10:00 AM
osd-latency.png (414 KB) osd-latency.png		Roland Sommer, 09/07/2021 10:00 AM
disk-read-write-data.png (259 KB) disk-read-write-data.png		Roland Sommer, 09/07/2021 01:22 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #52523

Latency spikes causing timeouts after upgrade to pacific (16.2.5)

Updated by Roland Sommer over 2 years ago

Updated by Greg Farnum over 2 years ago

Updated by Igor Fedotov over 2 years ago

Updated by Roland Sommer over 2 years ago

Updated by Roland Sommer over 2 years ago

Updated by Igor Fedotov over 2 years ago

Updated by Roland Sommer over 2 years ago

Updated by Igor Fedotov over 2 years ago

Updated by Igor Fedotov over 2 years ago