Bug #64418
closedRGW garbage collection stuck and growing
0%
Description
We have a gc list which is over 5 million and does not ever shrink any more.
Context:- Quincy, 2700 OSDs on over 100 nodes,
- EC8+3 HDD data pool
- all other pools on NVMe - log control meta buckets.index
- 9x RGW nodes behind haproxy with rgw_enable_gc_threads = false
- 3x RGW ('RGC') nodes not accessible to users with rgw_enable_gc_threads = false
Stripping this down to 1 RGC node, lsof shows over 2 entries, which drops to a few thousand with the daemon stopped.
With debug_rgw at 20/20 we get logs like:
2024-02-13T22:28:11.938+0800 7f5e327fc700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:c0a34c69-bce9-4053-9ead-8ee81faae4c1.146573.13__shadow_1338829496_731829_ms.tar.2~aqFdnrOgYMEBSY6vl64PQ49HiMQCNAH.2036_1
...
20 rgw reshard worker thread: processing logshard = reshard.0000000000
20 rgw reshard worker thread: finish processing logshard = reshard.0000000000
...
2024-02-13T22:33:41.511+0800 7f5e31ffb700 20 rgw object expirer Worker thread: processing shard = obj_delete_at_hint.0000000119
A few thousand 'process removing' lines shortly after restarting the daemon, and then no more.
A few dozen pairs of 'reshard worker thread' after that.
A few dozen pairs of 'object expirer' after that.
Then, silence, except for the periodic "RGWDataChangesLog::ChangesRenewThread: start".
I will also post the whole log file.
How can we get a handle on our gc and get it processed? Or, extract more information about why it is not being processed?
Files
Updated by Gregory Orange 3 months ago
correction:
-over 2 entries
+over 2 million entries
Updated by Gregory Orange 3 months ago
Here is the last 1MB of the log file at 20/20.
It's also worth noting that the RGW nodes are sometimes giving 504 gateway timeout errors when reading or writing to the cluster.
Updated by Gregory Orange 3 months ago
Against every PG we ran:
rados --pgid $id ls
We restarted that OSD, and now we have:
- gc processed from 5.5m to 0 over about 12 hours
- RGW 504 gateway timeouts have disappeared
- reads and writes all complete without failure
- bucket listing is successful too
So the problem is not RGW at all, but OSD health check. This can be closed.
Updated by Casey Bodley 3 months ago
- Related to Bug #64527: Radosgw 504 timeouts & Garbage collection is frozen added