Project

General

Profile

Bug #13894

RBD over cache tier over EC pool: rbd rm doesn't remove objects

Added by Laurent GUERBY about 3 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
11/27/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:

Description

When using rbd rm on an image over an EC 4+1 pool witch replicated 3 cache, the command exits successfully but no object are reclaimed at all.

Steps to reproduce : http://tracker.ceph.com/issues/13894#note-4

cmds (865 Bytes) Igor Fedotov, 12/03/2015 06:23 PM

more_ec_log.diff View (720 Bytes) Igor Fedotov, 12/03/2015 06:23 PM

osd.1.log.part (14.5 KB) Igor Fedotov, 12/03/2015 06:23 PM

osd.2.log.part (12.7 KB) Igor Fedotov, 12/03/2015 06:23 PM

osd.1.log.part (113 KB) Igor Fedotov, 12/03/2015 06:25 PM

osd.2.log.part (112 KB) Igor Fedotov, 12/03/2015 06:25 PM

History

#1 Updated by Laurent GUERBY about 3 years ago

We

root@g1:~# date;time rbd rm ec4p1/volume-bd767663-ca3a-4b88-82bd-30817f20f762;echo $?;date
Wed Nov 25 16:23:25 CET 2015
Removing image: 100% complete...done.

real 2279m21.274s
user 4m20.364s
sys 1m27.716s
0
Fri Nov 27 06:22:46 CET 2015

38 hours for 4 TB image

Before and after the rbd rm the ec4p1 line in rados df doesn't move, and more importantly no space is reclaimed on the OSDs disks:

root@g1:~# rados df
pool name KB objects clones degraded unfound rd rd KB wr wr KB
disks 6740979131 1680227 2703 0 0 1451792717 54327494085 4881140923 41628432627
ec4p1 19224897390 4764040 0 0 0 20580225 673621965 9433745 31624769240
ec4p1c 41297 2672953 0 0 0 661388717 43980201490 182101291 29607440438
ec8p2 20925860279 5130376 0 0 0 1271950 69476880 5405703 21846811927
ec8p2c 155750819 39791 0 0 0 17251650 707945812 10780396 6777113360
ssds 529073550 132527 27 0 0 911181397 11712297123 19929864890 123650540907
total used 73286019476 14419914
total avail 18734471448
total space 92020916908

#2 Updated by Laurent GUERBY about 3 years ago

The same issue was reported one year ago, no answer on the list:

http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-September/043206.html

#3 Updated by Laurent GUERBY about 3 years ago

We have leftover objects showing up in rados ls, but rados get fails on them while the shard files are presents on disk.

root@g1:~# rados -p ec4p1 ls|grep rbd_data.172aba92ff8cf52.00000000000000aa
rbd_data.172aba92ff8cf52.00000000000000aa
root@g1:~# ceph osd map ec4p1 rbd_data.172aba92ff8cf52.00000000000000aa
osdmap e180913 pool 'ec4p1' (80) object 'rbd_data.172aba92ff8cf52.00000000000000aa' -> pg 80.7a88602b (80.2b) -> up ([28,35,16,33,39], p28) acting ([28,35,16,33,39], p28)
root@g1:~# ceph pg dump|grep "80.2b"|tail -1
pg_stat objects mip     degr    misp    unf     bytes   log     disklog state   state_stamp     v       reported        up      up_primary      acting  acting_primary  last_scrub      scrub_stamp     last_deep_scrub deep_scrub_stamp
80.2b        1122        0        0        0        0        4624171008        2335        2335        active+clean        2015-11-19 06:58:22.633854        89234'2335        180913:7904        [28,35,16,33,39]        28        [28,35,16,33,39]        28        89234'2335        2015-11-19 06:58:22.633514        89234'2335        2015-11-19 06:58:22.633514
root@g1:~# rados -p ec4p1 get rbd_data.172aba92ff8cf52.00000000000000aa out1
error getting ec4p1/rbd_data.172aba92ff8cf52.00000000000000aa: (2) No such file or directory
root@g1:~# rados -p ec4p1 listxattr rbd_data.172aba92ff8cf52.00000000000000aa
error getting xattr set ec4p1/rbd_data.172aba92ff8cf52.00000000000000aa: (2) No such file or directory
root@g1:~# rados -p ec4p1 listomapvals rbd_data.172aba92ff8cf52.00000000000000aa
error getting omap keys ec4p1/rbd_data.172aba92ff8cf52.00000000000000aa: (2) No such file or directory
root@g6:/var/lib/ceph/osd/ceph-28/current/80.2bs0_head# ls -l $(find . -name '*172aba92ff8cf52.00000000000000aa*')
-rw-r--r-- 1 root root 1048576 Oct 29 13:42 ./DIR_B/DIR_2/DIR_0/DIR_6/rbd\udata.172aba92ff8cf52.00000000000000aa__head_7A88602B__50_ffffffffffffffff_0
root@g9:/var/lib/ceph/osd/ceph-35/current/80.2bs1_head# ls -l $(find . -name '*172aba92ff8cf52.00000000000000aa*')
-rw-r--r-- 1 root root 1048576 Sep 30 07:26 ./DIR_B/DIR_2/DIR_0/DIR_6/rbd\udata.172aba92ff8cf52.00000000000000aa__head_7A88602B__50_ffffffffffffffff_1
root@g3:/var/lib/ceph/osd/ceph-16/current/80.2bs2_head# ls -l $(find . -name '*172aba92ff8cf52.00000000000000aa*')
-rw-r--r-- 1 root root 1048576 Jul  7 14:15 ./DIR_B/DIR_2/DIR_0/DIR_6/rbd\udata.172aba92ff8cf52.00000000000000aa__head_7A88602B__50_ffffffffffffffff_2
root@g9:/var/lib/ceph/osd/ceph-33/current/80.2bs3_head# ls -l $(find . -name '*172aba92ff8cf52.00000000000000aa*')
-rw-r--r-- 1 root root 1048576 Sep 30 07:26 ./DIR_B/DIR_2/DIR_0/DIR_6/rbd\udata.172aba92ff8cf52.00000000000000aa__head_7A88602B__50_ffffffffffffffff_3
root@g10:/var/lib/ceph/osd/ceph-39/current/80.2bs4_head# ls -l $(find . -name '*172aba92ff8cf52.00000000000000aa*')
-rw-r--r-- 1 root root 1048576 Sep 30 07:26 ./DIR_B/DIR_2/DIR_0/DIR_6/rbd\udata.172aba92ff8cf52.00000000000000aa__head_7A88602B__50_ffffffffffffffff_4

#4 Updated by Mehdi Abaakouk about 3 years ago

I have setuped a cluster with one osd node to reproduce and got the issue easily. I have not used erasure coding, so the problem seems to be located in tier cache code.

Steps :

$ sudo ceph osd crush rule dump
[ { "rule_id": 0,
"rule_name": "replicated_ruleset",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [ { "op": "take",
"item": -1,
"item_name": "default"}, { "op": "choose_firstn",
"num": 0,
"type": "osd"}, { "op": "emit"}]}]

$ sudo ceph osd pool create data 8 8 replicated replicated_ruleset
$ sudo ceph osd pool create cache 8 8 replicated replicated_ruleset
$ sudo ceph osd tier add data cache
$ sudo ceph osd tier cache-mode cache writeback
$ sudo ceph osd tier set-overlay data cache
$ sudo ceph osd pool set data size 1
$ sudo ceph osd pool set data min_size 1
$ sudo ceph osd pool set cache size 1
$ sudo ceph osd pool set cache min_size 1
$ sudo ceph osd pool set cache cache_target_dirty_ratio 0.4
$ sudo ceph osd pool set cache cache_target_full_ratio 0.8
$ sudo ceph osd pool set cache target_max_bytes 100000000000
$ sudo ceph osd pool set cache target_max_objects 100000000000
$ sudo ceph osd pool set cache hit_set_type bloom
$ sudo ceph osd pool set cache hit_set_count 1
$ sudo ceph osd pool set cache hit_set_period 3600
$ sudo ceph osd dump | grep ^pool
pool 7 'data' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 30 lfor 28 flags hashpspool tiers 8 read_tier 8 write_tier 8 stripe_width 0
pool 8 'cache' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 39 flags hashpspool,incomplete_clones tier_of 7 cache_mode writeback target_bytes 100000000000 target_objects 100000000000 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0

$ dd if=/dev/urandom of=file bs=1M count=1000

$ find /var/lib/ceph | grep rbd.*data -c
0

$ sudo rbd -p data import --image-format 2 file file
$ find /var/lib/ceph | grep rbd.*data -c
250

$ sudo rbd -p data rm file
Removing image: 100% complete...done.
$ find /var/lib/ceph | grep rbd.*data -c
250

$ sudo ceph -s
cluster fa8db191-30ea-4dd4-be0b-fe75bf2f090b
health HEALTH_WARN too few pgs per osd (16 < min 20)
monmap e1: 1 mons at {master=192.168.3.51:6789/0}, election epoch 1, quorum 0 master
osdmap e39: 1 osds: 1 up, 1 in
pgmap v82: 16 pgs, 2 pools, 0 bytes data, 255 objects
144 MB used, 319 GB / 319 GB avail
16 active+clean

#5 Updated by Mehdi Abaakouk about 3 years ago

$ sudo rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
cache - 0 255 0 0 0 37 28 761 1024001
data - 0 0 0 0 0 0 0 0 0
total used 141288 255
total avail 335239192
total space 335380480

#6 Updated by Mehdi Abaakouk about 3 years ago

We can see neither the data or the cache are deleted too:

$ sudo rbd -p data import --image-format 2 file file
Importing image: 100% complete...done.
$ find /var/lib/ceph | grep rbd.*data -c
250

$ sudo rados -p cache cache-flush-evict-all
$ find /var/lib/ceph | grep -c rbd.*data
250

$ sudo rbd -p data rm file
Removing image: 100% complete...done.

$ find /var/lib/ceph | grep -c rbd.*data
500

$ sudo rados -p data ls | wc -l
253

$ sudo rados -p data ls | grep rbd_data.372b74b0dc51.0000000000000037
rbd_data.372b74b0dc51.0000000000000037

$ sudo rados -p data get rbd_data.372b74b0dc51.0000000000000037 -
error getting data/rbd_data.372b74b0dc51.0000000000000037: (2) No such file or directory

$ find /var/lib/ceph -path '*rbd*data.372b74b0dc51.0000000000000037*'
/var/lib/ceph/osd/ceph-0/current/1.7_head/rbd\udata.372b74b0dc51.0000000000000037__head_AD2CFB1F__1
/var/lib/ceph/osd/ceph-0/current/2.7_head/rbd\udata.372b74b0dc51.0000000000000037__head_AD2CFB1F__2

$ sudo rados df
pool name KB objects clones degraded unfound rd rd KB wr wr KB
cache 0 257 0 0 0 36 26 1015 1024001
data 1024001 253 0 0 0 0 0 253 1024001
total used 1168592 510
total avail 334211888
total space 335380480

#7 Updated by Mehdi Abaakouk about 3 years ago

Then I have put the cache in forward mode:

$ sudo ceph osd tier cache-mode cache forward
$ sudo rados -p cache cache-flush-evict-all
$ sudo rados -p cache ls
  1. cache is empty
    $ sudo rados -p data ls
  2. data is empty
    $ find /var/lib/ceph | grep -c rbd.*data
    0

All files have been deleted.

#8 Updated by Mehdi Abaakouk about 3 years ago

I'm just guessing, but it seems that when we delete a rbd image, for each rados object, the tier cache load it into the cache, empty the object in the cache only, mark it as deleted, and do the same for the next one.

The real deletion is perhaps done during the next eviction cycle if the object match eviction criteria. But if the object is emptied into the cache, the "target_max_bytes" is difficult to reach to start the eviction process.

Does my intuition is good ?

For me, that looks weird to load the whole data content from the storage pool to the cache pool to delete it, but anyways. That will explain why the deletion of a rbd image is so slow when tier cache is used.

In our case, we have delete around ~24T of rbd images, but if my guessing is true the cache still have the list of all rados objects that pending to be deleted.

We finally launch the eviction manually on our cluster and we finally see the used space on our osd slowly go down.

Our first approximation is that we have to wait 17 days to really have the data deleted, now. When we have already wait 15 days to get all the 'rbd rm' completed.

Perhaps they are place to do a optimization in case of rados object deletion and tier cache to not load the whole object to empty it (If my guessing it true) ?

#9 Updated by Loic Dachary about 3 years ago

  • Description updated (diff)

#10 Updated by Samuel Just about 3 years ago

  • Status changed from New to Rejected

In current master, we'll actually proxy the delete operation to the base pool rather than promote, so this should work better. Even in this version though, it's simply that the whiteout objects haven't been flushed from the cache tier to the base pool. If you change the configs to force faster flushing, you should see the objects be deleted.

#11 Updated by Laurent GUERBY about 3 years ago

How do you force faster flushing in hammer?

Also if it's not fixed in the hammer branch I think it should be documented.

#12 Updated by Igor Fedotov about 3 years ago

I made some investigations on the topic in current master. Here are my observations.

I added some additional logging to EC decoding ( see attached diff ),
put 32M RADOS object to cache pool, waited for cache flush and removed it.
See attached 'cmds' file for corresponding pool creation and object modification operations.

Then I made some log analysis for the time period when object removal took place [ 20:14:47 - 20:14:54 ]
And I can see copy-get requests and EC decoding artifacts at this time. E.g following lines:

2015-12-03 20:14:54.759351 7fc22828d700 0 finish:8388608
...
2015-12-03 20:14:54.760020 7fc22828d700 1 -- 127.0.0.1:6804/38644 --> 127.0.0.1:0/38863 -- osd_op_reply(5 test [copy-get max 8388608] v0'0 uv9 ondisk = 0) v6 -- ?+0 0x7fc249fdf440 con 0x7fc24a388180
...
2015-12-03 20:14:54.811327 7fc225a88700 0 finish:8388608

#13 Updated by Igor Fedotov about 3 years ago

Please find appropriate logs attached too.

IMO this shows object promotion is still present in master branch.

#14 Updated by Igor Fedotov about 3 years ago

More detailed logs with following debug settings enabled:
debug osd = 25
debug filestore = 20
debug ms = 1

#15 Updated by Laurent GUERBY about 3 years ago

The issue we're facing is that the size reported by all ceph tools is completely wrong, from my ticket:

pool name KB objects
ec4p1c 41297 2672953

2.6 millions objects taking 41 megabytes according to ceph, but about 10 terabytes on disk.

As most ceph documentation do suggest to set target_max_bytes there will be no eviction at all on rbd rm until it's too late as we found out (OSD will die to full disks, ...).

The only way to prevent users to run into this issue is either fix the bug on the byte counting for cache promoted "rm"ed objects or to tell users for the years to come to use exclusively target_max_objects and not target_max_bytes to control caching, with target_max_objects based on their estimation of the available disk space and average object size.

Not fixing this issue will cause endless trouble for users for years to come.

#16 Updated by Samuel Just about 3 years ago

Igor: do you have hitsets configured? It looks like it always promotes if hitsets are disabled.

#17 Updated by Samuel Just about 3 years ago

  • Status changed from Rejected to Need More Info

Laurent, I agree that the size bug is a problem, can you open a new ticket specifically for that? This ticket is about failing to delete objects at all and proxying deletes.

#18 Updated by Igor Fedotov about 3 years ago

Samuel,
I have following setting for hitset only:
./ceph -c ceph.conf osd pool set cachepool hit_set_type bloom

Please find the complete set of options in 'cmds' file attached above.

#19 Updated by Igor Fedotov almost 3 years ago

PR for delete proxying is at
https://github.com/ceph/ceph/pull/7537

#20 Updated by Sage Weil almost 3 years ago

  • Status changed from Need More Info to Testing

#21 Updated by Erwan Velu almost 3 years ago

It sounds like the 10.1.0 is featuring a fix for this ticket.

Laurent, Medhi do you feel like testing it to validate it solved your issue ?

#22 Updated by Sage Weil almost 2 years ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF