Project

General

Profile

Bug #52277

[pwl] IO hang when the single IO size * io_depth > cache size

Added by CONGMIN YIN over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This bug is a boundary problem that is not easy to occur, only when the single IO size * io_depth > cache size, and in persist_on_flush mode.
For example, cache size=1GB, bs=400M,iodepth=4. The actual cache cap is 751619276Byte, so cache only can be writen single IO. First step, write one write req to cache and 3 in defer queue. Then a sync_point and flush_req is generated, but because the depth is 4, the flush request will be in the tail of defer queue. The data write back on the cache needs to wait for sync_point to write to the cache. Flush req wait for defer queue dispath. If data in cache can't be write back, there is no space. If there is no space, can't write sync point and write req before sync_point in defer queue, IO hang.

How to reproduce:

# cat /etc/ceph/ceph.conf
[client]
    rbd_cache = false
    debug rbd_pwl = 5
    log_file = /var/log/ceph/rbd.log
    rbd_persistent_cache_mode = rwl
    rbd_plugins = pwl_cache
    rbd_persistent_cache_size = 1073741824
    rbd_persistent_cache_path = /mnt/pmem/cache/

root@cephs:~/test-fio/ceph# cat tmp.conf 
[global]

ioengine=rbd
clientname=admin
rw=write
#bs=10m
bs=400M
time_based=1
runtime=10s
iodepth=4
group_reporting

[volumes]
pool=test
rbdname=image1

Wrong solution: execute the flush request in advance. The change is too large, which will lead to some unexpected problems. Like: https://github.com/ceph/ceph/pull/40208 need to be revert
Expected solution: force writeback when space is not enough and cannot get free space through retry.


Related issues

Related to rbd - Bug #52599: [pwl] flush requests are dispatched in advance Resolved

History

#1 Updated by CONGMIN YIN over 2 years ago

  • Status changed from New to Resolved
  • Pull request ID set to 40208

https://github.com/ceph/ceph/pull/40208 There is no problem with this solution, and this solution is still adopted.

#2 Updated by Ilya Dryomov over 2 years ago

  • Status changed from Resolved to Closed
  • Pull request ID deleted (40208)

Changing to Closed as there is no code change associated with this ticket. https://github.com/ceph/ceph/pull/40208 was backported to pacific immediately after merge five months ago.

#3 Updated by CONGMIN YIN over 2 years ago

  • Related to Bug #52599: [pwl] flush requests are dispatched in advance added

#4 Updated by CONGMIN YIN over 2 years ago

  • Status changed from Closed to New

#5 Updated by CONGMIN YIN over 2 years ago

As https://tracker.ceph.com/issues/52599 this solution applied, current issue won't appear. Internal flush request(from syncpoint) will still dispatch bypass defered queue, IO won't hang. close this issue.

#6 Updated by CONGMIN YIN over 2 years ago

  • Status changed from New to Closed

Also available in: Atom PDF