Project

General

Profile

Actions

Bug #64418

closed

RGW garbage collection stuck and growing

Added by Gregory Orange 3 months ago. Updated 3 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
gc
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a gc list which is over 5 million and does not ever shrink any more.

Context:
  • Quincy, 2700 OSDs on over 100 nodes,
  • EC8+3 HDD data pool
  • all other pools on NVMe - log control meta buckets.index
  • 9x RGW nodes behind haproxy with rgw_enable_gc_threads = false
  • 3x RGW ('RGC') nodes not accessible to users with rgw_enable_gc_threads = false

Stripping this down to 1 RGC node, lsof shows over 2 entries, which drops to a few thousand with the daemon stopped.

With debug_rgw at 20/20 we get logs like:

2024-02-13T22:28:11.938+0800 7f5e327fc700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:c0a34c69-bce9-4053-9ead-8ee81faae4c1.146573.13__shadow_1338829496_731829_ms.tar.2~aqFdnrOgYMEBSY6vl64PQ49HiMQCNAH.2036_1
...
20 rgw reshard worker thread: processing logshard = reshard.0000000000
20 rgw reshard worker thread: finish processing logshard = reshard.0000000000
...
2024-02-13T22:33:41.511+0800 7f5e31ffb700 20 rgw object expirer Worker thread: processing shard = obj_delete_at_hint.0000000119

A few thousand 'process removing' lines shortly after restarting the daemon, and then no more.
A few dozen pairs of 'reshard worker thread' after that.
A few dozen pairs of 'object expirer' after that.
Then, silence, except for the periodic "RGWDataChangesLog::ChangesRenewThread: start".

I will also post the whole log file.

How can we get a handle on our gc and get it processed? Or, extract more information about why it is not being processed?


Files


Related issues 1 (1 open0 closed)

Related to rgw - Bug #64527: Radosgw 504 timeouts & Garbage collection is frozenNew

Actions
Actions #1

Updated by Gregory Orange 3 months ago

correction:
-over 2 entries
+over 2 million entries

Actions #2

Updated by Gregory Orange 3 months ago

Here is the last 1MB of the log file at 20/20.

It's also worth noting that the RGW nodes are sometimes giving 504 gateway timeout errors when reading or writing to the cluster.

Actions #3

Updated by Casey Bodley 3 months ago

  • Tags set to gc
Actions #4

Updated by Gregory Orange 3 months ago

Against every PG we ran:
rados --pgid $id ls

We found some that didn't respond. ceph pg map on those showed a common OSD as the first in the list.
We restarted that OSD, and now we have:
  • gc processed from 5.5m to 0 over about 12 hours
  • RGW 504 gateway timeouts have disappeared
  • reads and writes all complete without failure
  • bucket listing is successful too

So the problem is not RGW at all, but OSD health check. This can be closed.

Actions #5

Updated by Casey Bodley 3 months ago

  • Status changed from New to Can't reproduce
Actions #6

Updated by Casey Bodley 3 months ago

  • Related to Bug #64527: Radosgw 504 timeouts & Garbage collection is frozen added
Actions

Also available in: Atom PDF