Project

General

Profile

Actions

Bug #12659

closed

Can't delete cache pool

Added by Paul Emmerich over 8 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
Tiering
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

this is kind of a follow-up from issue #10138 where we encountered some strange issues with cache tiers.

One of our pools affected by these issues is still up and running and we want to get rid of the cache tier.
However, we can't delete it because it seems to be in a really weird state.

Situation:

$ ceph -v
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
$ ceph osd pool ls detail
pool 5 'data' replicated size 3 min_size 2 crush_ruleset 3 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 25444 lfor 901 flags hashpspool tiers 7 read_tier 7 write_tier 7 min_read_recency_for_promote 1 stripe_width 0
removed_snaps [1~b0,be~4,c3~a2,166~130,297~4,29c~c,2a9~10,2ba~93,34e~4a,39b~1,3af~44,408~1,41e~1,42f~1,445~1,45b~1,45d~1,45f~1,461~1,463~1,465~2]
pool 7 'ssd_cache' replicated size 3 min_size 2 crush_ruleset 4 object_hash rjenkins pg_num 64 pgp_num 64 last_change 27177 flags hashpspool,incomplete_clones tier_of 5 cache_mode forward hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 min_read_recency_for_promote 1 stripe_width 0

data is the actual pool, ssd_cache the cache. As you can see, we aren't really using it.

We are trying to delete it like this:

$ ceph osd tier cache-mode ssd_cache forward
set cache-mode for pool 'ssd_cache' to forward
$ rados -p ssd_cache cache-flush-evict-all
    rbd_data.576562ae8944a.000000000000028b    
failed to evict /rbd_data.576562ae8944a.000000000000028b: (16) Device or resource busy
    rbd_header.1cc57b238e1f29    
failed to evict /rbd_header.1cc57b238e1f29: (16) Device or resource busy
    rbd_header.2019cd238e1f29    
... etc

All these objects that it's complaining about are also visible in the pool:

$ rados -p ssd_cache ls
rbd_data.576562ae8944a.000000000000028b
rbd_header.1cc57b238e1f29
rbd_header.2019cd238e1f29
...

Now, since some of these RBD images in question don't contain any valuable data, we just tried to delete one of the objects in the cache pool to check what happens, but deleting doesn't really seem to work on this cache pool:

$ rados -p ssd_cache rm rbd_header.784aca238e1f29
$ rados -p ssd_cache ls |grep rbd_header.784aca238e1f29
rbd_header.784aca238e1f29 # it's still there? huh?
$ rados -p ssd_cache rm rbd_header.784aca238e1f29
error removing ssd_cache>rbd_header.784aca238e1f29: (2) No such file or directory # or isn't it?

I'm just gonna assume that this weird behavior is somehow expected since this is a cache pool and we didn't delete the object from the actual pool.

However, the object we just deleted still shows up as one of the objects that it can't evict:

$ rados -p ssd_cache cache-flush-evict-all 2>&1 |grep 784aca238e1f29
    rbd_header.784aca238e1f29

Our current best solution is to migrate that broken pool to a new pool to fix this, but that would take quite some time.
Maybe someone has a better idea?

Paul

Actions

Also available in: Atom PDF