Project

General

Profile

Actions

Bug #52523

closed

Latency spikes causing timeouts after upgrade to pacific (16.2.5)

Added by Roland Sommer over 2 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After having run pacific in our low volume staging system for 2 months, yesterday we upgraded our production cluster from octopus to pacific. When the first OSD-Node was upgraded, we saw timeouts in several applications accessing the ceph cluster via radosgw. At first, we thought that was caused due to rebalancing/recovery, but the problem did persist after the cluster was green again. Every 10 Minutes, we experience a rise in latency. After seeking for a possible cause, we could see in our metric system that disk IO has gone up since the restart of the OSD-Nodes. Searching further, we saw that the interval ingest volume in the rocksdb statistics has also gone up by around factor 200 (see attached log excerpt).

The IO has increased on the blockdb-devices (which are SSDs), IO on the data devices (normal HDDs) has not increased. The general setup is one SSD with multiple LVM volumes which act as blockdb devices for the HDDs (everything on bluestore).


Files

ingest-logs.txt (2.29 KB) ingest-logs.txt Roland Sommer, 09/07/2021 10:00 AM
disk-io.png (252 KB) disk-io.png Roland Sommer, 09/07/2021 10:00 AM
osd-latency.png (414 KB) osd-latency.png Roland Sommer, 09/07/2021 10:00 AM
disk-read-write-data.png (259 KB) disk-read-write-data.png Roland Sommer, 09/07/2021 01:22 PM

Related issues 1 (0 open1 closed)

Is duplicate of bluestore - Bug #52089: Deferred writes are unexpectedly applied to large writes on spinnersResolved

Actions
Actions #1

Updated by Roland Sommer over 2 years ago

I attached another graph showing the increased amount of written data.

Actions #2

Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to RADOS
Actions #3

Updated by Igor Fedotov over 2 years ago

This could be related to https://tracker.ceph.com/issues/52089
@Roland, could yu please upgrade to 16.2.6 and update the issue status?

Actions #4

Updated by Roland Sommer over 2 years ago

We started rolling out 16.2.5-522-gde2ff323-1bionic from the dev repos on the osd nodes, as there is no release/tag v16.2.6 until now. The dev builds contain the mentioned commit. As of now, it seems to have fixed the problem.

Actions #5

Updated by Roland Sommer over 2 years ago

The cluster is running without any problems since we rolled out the latest dev release from the pacific branch to all nodes. Regarding the issue status, I am not allowed to change any of the fields, the stay read-only while editing.

Actions #6

Updated by Igor Fedotov over 2 years ago

Roland Sommer wrote:

The cluster is running without any problems since we rolled out the latest dev release from the pacific branch to all nodes. Regarding the issue status, I am not allowed to change any of the fields, the stay read-only while editing.

So I'm to mark this ticket as duplicate and close. Any objections?

Actions #7

Updated by Roland Sommer over 2 years ago

The ticket can be closed from our side - and it may be a duplicate, but I'm not able to say this for sure. But I have no objections if this is marked as duplicate. Thanks for the quick help.

Actions #8

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to Duplicate
Actions #9

Updated by Igor Fedotov over 2 years ago

  • Is duplicate of Bug #52089: Deferred writes are unexpectedly applied to large writes on spinners added
Actions

Also available in: Atom PDF