Bug #37734
closedLibrgw doesn't GC deleted object correctly
0%
Description
Hi, recently I work on NFS. I found a bug with Librgw GC process, here is the way to reproduce:
1.set rgw_num_rados_handles = 10 (set this param as small as possible, then we can easily see the problem). set rgw_gc param to small value,too, as we can check gc list soon.
2.create a bucket and expose it with NFS-Ganesha
3.mount this export to local
4.copy 6 local file(1GB) to local path, then the file should be uploaded to bucket
5.once all 1GB file have been successfully written, remove them
6.check cluster usage with 'rados df', you will find out, some data still exists in pool(default.rgw.buckets.data) , though the file are all deleted and gc list = [].
Here is some track info:
rados df before uploading
[root@node1 ~]# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
.rgw.root 7216 21 0 63 0 0 0 664914 432M 107 62464
default.rgw.buckets.data 21028M 6304 0 18912 0 0 0 31198 160M 66520 100803M
default.rgw.buckets.index 0 3584 0 10752 0 0 0 5742379 5608M 48338 0
default.rgw.buckets.non-ec 0 6 0 18 0 0 0 133 86016 137 0
default.rgw.control 0 512 0 1536 0 0 0 0 0 0 0
default.rgw.log 71273 1737 0 5211 0 0 0 220572996 210G 147050225 1306k
default.rgw.meta 6477 38 0 114 0 0 0 504132 411M 4873 1278k
fs_data 4157M 532 0 1596 0 0 0 18119 72473k 1244384 653G
fs_metadata 29189k 29 0 87 0 0 0 47 60416 906 29804k
total_objects 12763
total_used 83599M
total_avail 6438G
total_space 6519G
rados df after uploading
[root@node1 ~]# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
.rgw.root 7216 21 0 63 0 0 0 665100 433M 107 62464
default.rgw.buckets.data 27172M 7841 0 23523 0 0 0 31282 160M 68253 104G
default.rgw.buckets.index 0 3584 0 10752 0 0 0 5791282 5656M 48450 0
default.rgw.buckets.non-ec 0 6 0 18 0 0 0 133 86016 137 0
default.rgw.control 0 512 0 1536 0 0 0 0 0 0 0
default.rgw.log 71273 1737 0 5211 0 0 0 220602291 210G 147069518 1306k
default.rgw.meta 6477 38 0 114 0 0 0 506865 413M 4885 1278k
fs_data 4157M 532 0 1596 0 0 0 18119 72473k 1244384 653G
fs_metadata 29189k 29 0 87 0 0 0 47 60416 906 29804k
total_objects 14300
total_used 102027M
total_avail 6420G
total_space 6519G
rados df after deleting
[root@node1 ~]# rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
.rgw.root 7216 21 0 63 0 0 0 665367 433M 107 62464
default.rgw.buckets.data 22048M 6560 0 19680 0 0 0 32649 160M 69534 104G
default.rgw.buckets.index 0 3584 0 10752 0 0 0 5792813 5658M 48474 0
default.rgw.buckets.non-ec 0 6 0 18 0 0 0 133 86016 137 0
default.rgw.control 0 512 0 1536 0 0 0 0 0 0 0
default.rgw.log 71273 1737 0 5211 0 0 0 220614002 210G 147074013 1306k
default.rgw.meta 6477 38 0 114 0 0 0 507762 414M 4885 1278k
fs_data 4157M 532 0 1596 0 0 0 18119 72473k 1244384 653G
fs_metadata 29189k 29 0 87 0 0 0 47 60416 906 29804k
total_objects 13019
total_used 86703M
total_avail 6435G
total_space 6519G
As you can see, only 5GB data are deleted, 1GB still remains.
I also print the obj unique tag:
obj1
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502217.0
RGW GC adding chain
obj2
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502208.0
RGW GC adding chain
obj3
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502219.0
RGW GC adding chain
obj4
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502220.0
RGW GC adding chain
obj5
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502214.0
RGW GC adding chain
obj6
RGW GC chain size: 255, with tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502214.0
RGW GC adding chain
obj5 and obj6 have same tail tag: b2b41852-866c-4bab-9160-3e1a1b5d7f81.6502214.0
this tag can be divided into 3 part: zone_param_id.rgw_rados_handle_id.RGW_Request_id
obj5 and obj6 seems share the same rgw_rados_handle, but the rgw request id are always 0. I think this is the main reason that confuse RGW GC thread
Updated by Tao CHEN over 5 years ago
Here is the patch:
https://github.com/ceph/ceph/pull/25664
Updated by Matt Benjamin almost 5 years ago
ok, I think I get this; that said--the use of >1 rados handle is not at all recommended; that said, the fix looks acceptble
Updated by Casey Bodley almost 5 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 28108
Updated by Matt Benjamin almost 5 years ago
- Backport set to nautilus, mimic, luminous
Updated by Matt Benjamin almost 5 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler almost 5 years ago
- Copied to Backport #40106: mimic: Librgw doesn't GC deleted object correctly added
Updated by Nathan Cutler almost 5 years ago
- Copied to Backport #40107: nautilus: Librgw doesn't GC deleted object correctly added
Updated by Nathan Cutler almost 5 years ago
- Copied to Backport #40108: luminous: Librgw doesn't GC deleted object correctly added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".