Bug #64418: RGW garbage collection stuck and growing - rgw - Ceph

Actions

Copy link

Bug #64418

closed

RGW garbage collection stuck and growing

Added by Gregory Orange 3 months ago. Updated 3 months ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.6

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a gc list which is over 5 million and does not ever shrink any more.

Context:

Quincy, 2700 OSDs on over 100 nodes,
EC8+3 HDD data pool
all other pools on NVMe - log control meta buckets.index
9x RGW nodes behind haproxy with rgw_enable_gc_threads = false
3x RGW ('RGC') nodes not accessible to users with rgw_enable_gc_threads = false

Stripping this down to 1 RGC node, lsof shows over 2 entries, which drops to a few thousand with the daemon stopped.

With debug_rgw at 20/20 we get logs like:

2024-02-13T22:28:11.938+0800 7f5e327fc700 5 garbage collection: RGWGC::process removing default.rgw.buckets.data:c0a34c69-bce9-4053-9ead-8ee81faae4c1.146573.13__shadow_1338829496_731829_ms.tar.2~aqFdnrOgYMEBSY6vl64PQ49HiMQCNAH.2036_1
...
20 rgw reshard worker thread: processing logshard = reshard.0000000000
20 rgw reshard worker thread: finish processing logshard = reshard.0000000000
...
2024-02-13T22:33:41.511+0800 7f5e31ffb700 20 rgw object expirer Worker thread: processing shard = obj_delete_at_hint.0000000119

A few thousand 'process removing' lines shortly after restarting the daemon, and then no more.
A few dozen pairs of 'reshard worker thread' after that.
A few dozen pairs of 'object expirer' after that.
Then, silence, except for the periodic "RGWDataChangesLog::ChangesRenewThread: start".

I will also post the whole log file.

How can we get a handle on our gc and get it processed? Or, extract more information about why it is not being processed?

Files

ceph-client.rgw.acacia-astro-p-rgc-03.log (862 KB) ceph-client.rgw.acacia-astro-p-rgc-03.log

Gregory Orange, 02/13/2024 02:50 PM

Related issues 1 (1 open — 0 closed)