Project

General

Profile

Actions

Documentation #44958

closed

Ceph v12.2.13 causes extreme high number of blocked operations

Added by Chris Jones about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Tags:
gc
Backport:
nautilus octopus
Reviewed:
Affected Versions:
Pull request ID:

Description

Ceph v12.2.13 yields an extremely high number of blocked requests.

We are using ceph v12.2.12 on some of our clusters with relatively few issues. We attempted an upgrade to v12.2.13 on several clusters and immediately started getting an extreme high number of blocked requests under load. This in turn caused OSD suicides and overall major reduction in performance of the cluster. For a cluster used to handling thousands of requests per minute, with extremely rare incidence of blocked requests on v12.2.12, this was very concerning. Blocked requests would shoot up to the thousands across 10 to 100 osds at a time.

The condition was so severe that, while there was no data loss, the cluster became impractical to use. We reverted back to v12.2.12 and the problem went away.

This condition was not a consistent condition, but was sporadic as the load on the cluster changed. We suspect that the blocked operations may have been related to leveldb compaction that triggered a cascading effect, but we are unclear.

This cluster was fresh installed as jewel v10.2.11 approx 1-2 years ago and has been in operation since. We first upgraded to v12.2.12 when it was released, and then to v12.2.13.

There were no issues with v10.2.11 nor with v12.2.12, but v12.2.13 immediately yielded performance issues primarily in the form of blocked osd requests and osd suicides due to the suicide timeout. Increasing suicide timeout only exacerbated the issue by allowing blocked osd requests to block for longer periods of time.

All v12.2.13 clusters have since been downgraded back to v12.2.12, so I do not have an active cluster on which to debug, but I am very willing to provide any additional detail you might need to provide a diagnosis or explanation, or to replicate the issue.

Cluster size is approx 2PB with 9 cluster nodes of approx 60x6TB 6TB HDD spinning disk per node (540 total disks in cluster), 6/3 erasure coding. Some of the upgraded clusters have SSD journals, while some do not. We are using filestore and xfs.

Our main concern is that v12.2.12 runs perfectly well, while v12.2.13 does not.

Please let us know if this is a known issue, and how we can help resolve this.


Related issues 2 (0 open2 closed)

Copied to rgw - Backport #45479: octopus: Ceph v12.2.13 causes extreme high number of blocked operationsResolvedNathan CutlerActions
Copied to rgw - Backport #45480: nautilus: Ceph v12.2.13 causes extreme high number of blocked operationsResolvedNathan CutlerActions
Actions

Also available in: Atom PDF