Bug #57534: trash purge stuck and remove images hang when the pool quota is full - rbd - Ceph

Actions

Copy link

Bug #57534

open

trash purge stuck and remove images hang when the pool quota is full

Added by kevin huang over 1 year ago. Updated about 1 year ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

09/14/2022

Affected Versions:

Ceph - v16.2.10

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi all

I tried to do trash purge and remove unused images when one pool (xxxx032) is full.

But the pool (xxxx032) purge is stuck, and the remove unused images is hang too.

I have two ceph cluster, one is pacific (stable) version 16.2.10 and another is octopus version 15.2.17.

The issue can be reproduce in the two ceph version.

Actions

Copy link

Updated by kevin huang over 1 year ago

The reproduce steps as the below

[root@ceph-node01 ~]# ceph osd pool create test1 16
pool 'test1' created
[root@ceph-node01 ~]# rbd pool init test1
[root@ceph-node01 ~]# ceph osd pool set-quota test1 max_bytes $((1 * 1024 * 1024 * 1024))
set-quota max_bytes = 1073741824 for pool test1
[root@ceph-node01 ~]# rbd create --size 600 test1/img1g001 --thick-provision
Thick provisioning: 100% complete...done.
[root@ceph-node01 ~]# rbd create --size 600 test1/img1g002 --thick-provision
Thick provisioning: 100% complete...done.
[root@ceph-node01 ~]# rbd list test1 -l
NAME SIZE PARENT FMT PROT LOCK
img1g001 600 MiB 2
img1g002 600 MiB 2
[root@ceph-node01 ~]# ceph health detail
HEALTH_WARN 1 pool(s) full
[WRN] POOL_FULL: 1 pool(s) full
pool 'test1' is full (running out of quota)
[root@ceph-node01 ~]# rbd rm test1/img1g001
CTL+C
The rm action is hang ...

Actions

Copy link

Updated by Ilya Dryomov over 1 year ago

Hi Kevin,

What version of Ceph is installed on the client side, i.e. on the node where you are running this test on? What is the output of "rbd --version"?

The reason I ask is this was fixed in 16.2.8 and later releases with the caveat that the "problematic" remove should not be the first remove in that pool, see https://tracker.ceph.com/issues/52734. Your test case passes for me with that slight modification:

$ ceph osd pool create test1 16
pool 'test1' created
$ rbd pool init test1
$ ceph osd pool set-quota test1 max_bytes $((1 * 1024 * 1024 * 1024))
set-quota max_bytes = 1073741824 for pool test1
$ rbd create --size 1 test1/dummy                                           <------
$ rbd rm test1/dummy                                                        <------
Removing image: 100% complete...done.
$ rbd create --size 600 test1/img1g001 --thick-provision
Thick provisioning: 100% complete...done.
$ rbd create --size 600 test1/img1g002 --thick-provision
Thick provisioning: 100% complete...done.
$ ceph health detail
HEALTH_WARN 1 pool(s) full
[WRN] POOL_FULL: 1 pool(s) full
    pool 'test1' is full (running out of quota)
$ rbd rm test1/img1g001
Removing image: 100% complete...done.
$ rbd rm test1/img1g002
Removing image: 100% complete...done.
$ ceph health detail
HEALTH_OK

The need for the "dummy" remove is just an oversight -- it is only needed in the "reached quota" case, not when the pool actually becomes full. In practice, people tend to run into ENOSPC ("No space left on device") far more often than they run into EDQUOT ("Disk quota exceeded"), possibly because pool quota is not a widely used feature. Nevertheless, I'm going to address it ASAP.

Actions

Copy link

Updated by Ilya Dryomov over 1 year ago

Tags deleted (~~stuck pool quota full~~)

Actions

Copy link

Updated by Ilya Dryomov over 1 year ago

kevin huang wrote:

[root@ceph-node01 ~]# ceph osd pool set-quota test1 max_bytes $((1 * 1024 * 1024 * 1024))
set-quota max_bytes = 1073741824 for pool test1
[root@ceph-node01 ~]# rbd create --size 600 test1/img1g001 --thick-provision
Thick provisioning: 100% complete...done.
[root@ceph-node01 ~]# rbd create --size 600 test1/img1g002 --thick-provision
Thick provisioning: 100% complete...done.

Also note that you may need to CTRL+C this command as well. Because pool quota is not precise -- it lags behind by a few seconds, sometimes you would be able to write 600M + 600M = 1.2G into a pool with max_bytes set to 1G and sometimes not.

Actions

Copy link