Project

General

Profile

Bug #37289

Issue with overfilled OSD for cache-tier pools

Added by Oleksandr Mykhalskyi over 5 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have bad issue in our ceph cluster.

Centos 7.5 (3.10.0-862.3.2.el7.x86_64)
Luminous 12.2.5, bluestore OSDs, using cache-tier feature
Openstack Pike (qemu-kvm-ev-2.10.0, libvirt-daemon-3.9.0-14)
Affected clients in the cloud have various OS (Centos 7.2, Centos 7.5, Redhat 6.7)

When one of OSD (device class ssd, where our cache tier pools located) reached 95% utilization, certainly all cache tier pools became blocked. I added more OSDs to resolve this overflow and expected, that clients will be unfreezed and continue to work, like in case of overflow of regular replicated pools (or reaching quota for replicated pool).
But not…

Clients stayed in hanged state, we have to reboot them, after reboot there were errors like:
[ 9.551419] blk_update_request: I/O error, dev vdb, sector 20973600
[ 9.555494] Buffer I/O error on device vdb2, logical block 4
[ 9.559532] lost page write due to I/O error on vdb2

We fixed it by “rbd object-map rebuild” for affected volumes.

From ceph documentation:
“When a pool quota is reached, librados operations now block indefinitely, the same way they do when the cluster fills up. (Previously they would return -ENOSPC.)
By default, a full cluster or pool will now block. If your librados application can handle ENOSPC or EDQUOT errors gracefully,
you can get error returns instead by using the new librados OPERATION_FULL_TRY flag”

It seems, that for cache-tier pools this behavior doesn’t work?

Details of my test ceph cluster, created to reproduce the issue – in the attachment

P.S. I tried this case on Luminous 12.2.9 – the same results.

Thank you

details_ceph_cluster.txt View (5.67 KB) Oleksandr Mykhalskyi, 11/16/2018 12:43 PM

History

#1 Updated by Greg Farnum over 5 years ago

  • Project changed from Ceph to RADOS
  • Category changed from common to Tiering

#2 Updated by Sage Weil over 5 years ago

  • Status changed from New to 12

I think teh first question to answer is if this can be reproduced without cache tiering. It's not immediately clear to me that its' tiering related.. might just be a problem with our ENOSPC handling.

#3 Updated by Oleksandr Mykhalskyi over 5 years ago

Whithout cache tiering everything is good.

After reaching 95% utilization of OSD for my replicated pool (whithout cache tier), I have "freezing" of I/O activitly on guests. After resolving problems with OSD utilization, guests continue to work without any issues. And the same behaviour - after reaching a quota for such pool.

We have problems only with cache tiering.

#4 Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New

Also available in: Atom PDF