Documentation #44958: Ceph v12.2.13 causes extreme high number of blocked operations - rgw - Ceph

Actions

Copy link

Documentation #44958

closed

Ceph v12.2.13 causes extreme high number of blocked operations

Added by Chris Jones about 4 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

Tags:

Backport:

nautilus octopus

Reviewed:

Affected Versions:

Ceph - v12.2.13

Pull request ID:

34952

Description

Ceph v12.2.13 yields an extremely high number of blocked requests.

We are using ceph v12.2.12 on some of our clusters with relatively few issues. We attempted an upgrade to v12.2.13 on several clusters and immediately started getting an extreme high number of blocked requests under load. This in turn caused OSD suicides and overall major reduction in performance of the cluster. For a cluster used to handling thousands of requests per minute, with extremely rare incidence of blocked requests on v12.2.12, this was very concerning. Blocked requests would shoot up to the thousands across 10 to 100 osds at a time.

The condition was so severe that, while there was no data loss, the cluster became impractical to use. We reverted back to v12.2.12 and the problem went away.

This condition was not a consistent condition, but was sporadic as the load on the cluster changed. We suspect that the blocked operations may have been related to leveldb compaction that triggered a cascading effect, but we are unclear.

This cluster was fresh installed as jewel v10.2.11 approx 1-2 years ago and has been in operation since. We first upgraded to v12.2.12 when it was released, and then to v12.2.13.

There were no issues with v10.2.11 nor with v12.2.12, but v12.2.13 immediately yielded performance issues primarily in the form of blocked osd requests and osd suicides due to the suicide timeout. Increasing suicide timeout only exacerbated the issue by allowing blocked osd requests to block for longer periods of time.

All v12.2.13 clusters have since been downgraded back to v12.2.12, so I do not have an active cluster on which to debug, but I am very willing to provide any additional detail you might need to provide a diagnosis or explanation, or to replicate the issue.

Cluster size is approx 2PB with 9 cluster nodes of approx 60x6TB 6TB HDD spinning disk per node (540 total disks in cluster), 6/3 erasure coding. Some of the upgraded clusters have SSD journals, while some do not. We are using filestore and xfs.

Our main concern is that v12.2.12 runs perfectly well, while v12.2.13 does not.

Please let us know if this is a known issue, and how we can help resolve this.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Chris Jones about 4 years ago

It appears that the increased efficiency in garbage collection in v12.2.13 versus v12.2.12 are the root cause of the blocked/slow requests. In v12.2.12 we had very aggressive garbage collection settings in order to keep up with garbage collection. In v12.2.13, those same settings caused extremely high numbers of garbage items to be removed in a short period of time. This in turn led to high rates of leveldb compaction, which was causing the slow requests and eventually the osd suicides.

By reverting our garbage collection configurations to a much more conservative rate, the situation has been resolved.

Actions

Copy link

Updated by Dan Hill about 4 years ago

Yeah, there were several improvements to GC processing in 12.2.13:
rgw: gc use aio (issue#24592, pr#28784, Yehuda Sadeh, Zhang Shaowen, Yao Zongyou, Jesse Williamson)
rgw: resolve bugs and clean up garbage collection code (issue#38454, pr#31664, Dan Hill, J. Eric Ivancich)

Perhaps the AIO GC change should be mentioned more prominently in the release notes?

Actions

Copy link

Updated by Chris Jones about 4 years ago

There is an undocumented setting rgw gc max concurrent io that appears to have been introduced. I did not find that in any official ceph documentation, but I saw this in Red Hat's documentation. It would be good to better document the config value and note its effect.

I have been experimenting with different settings of garbage collection with varying degrees of success.

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Tracker changed from Bug to Documentation
Assignee set to Casey Bodley
Tags set to gc

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Backport set to nautilus octopus
Pull request ID set to 34952

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Status changed from New to Pending Backport

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45479: octopus: Ceph v12.2.13 causes extreme high number of blocked operations added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Copied to Backport #45480: nautilus: Ceph v12.2.13 causes extreme high number of blocked operations added

Actions

Copy link

Updated by Nathan Cutler almost 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries