Project

General

Profile

Actions

Bug #48103

open

Bilogs automatic trimming fails (multisite, ceph nautilus 14.2.9)

Added by David Piper over 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Issue =====
We're seeing a problem in our multisite Ceph deployment, where automatic bilog trimming has stopped across our buckets.

We're running two RGW zones in the same zonegroup, on Ceph Nautilus 14.2.9 and running the services in containers. In each zone we have 3 hosts with 1 OSD on each host.

Expectation ===========

Our understanding is that trimming should happen automatically every 20 minutes on a randomly selected <= 16 buckets, and will trim a maximum of 1000 logs per shard. We've convinced ourselves that the activity on our buckets should not be causing bilogs to grow faster than they can be trimmed by this process, and initially the bilogs are held at a steady size. Our expectation is that this continues.

Observed behaviour ==================

At some point, several days after deploying the cluster, the bilog trimming appears to stop happening and slowly our bilog indexes start to grow. This is causing bilogs to accumulate over time, leading to large OMAP object warnings for the indexes on these buckets.

In every case, Ceph reports that the bucket is in sync and the data is consistent across both sites.

We've discovered that we can manually run 'radosgw-admin bilog autotrim' and bring the bilog counts down, on all of the buckets, by 1000 per shard. This is the process we thought should happen automatically. One option for us is to set up a cron job to run this command regularly. The implications of doing this aren't clear however.

"rgw_sync_log_trim_min_cold_buckets": "4",

Are there additional diags we can collect to confirm whether this is a bug or not / why auto trimming has stopped?

Is it safe to manually run 'radosgw-admin bilog autotrim' on a regular basis?

Config ======

Checking the running config on the mon service, we're running with the following config:

"rgw_sync_log_trim_concurrent_buckets": "4",
"rgw_sync_log_trim_interval": "1200",
"rgw_sync_log_trim_max_buckets": "16",

I've included an example below for one such affected bucket, showing its current state. Zone details (as per 'radosgw-admin zonegroup get') are at the bottom.

$ radosgw-admin bucket sync status --bucket=edin2z6-sharedconfig
realm b7f31089-0879-4fa2-9cbc-cfdf5f866a35 (geored_realm)
zonegroup 5d74eb0e-5d99-481f-ae33-43483f6cebc0 (geored_zg)
zone c48f33ad-6d79-4b9f-a22f-78589f67526e (siteA)
bucket
edin2z6-sharedconfig[033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1]

source zone 0a3c29b7-1a2c-432d-979b-d324a05cc831 (siteApubsub)
full sync: 0/1 shards
incremental sync: 0/1 shards
bucket is caught up with source
source zone 9f5fba56-4a32-46a6-8695-89253be81614 (siteB)
full sync: 0/1 shards
incremental sync: 1/1 shards
bucket is caught up with source
source zone c72b3aa8-a051-4665-9421-909510702412 (siteBpubsub)
full sync: 0/1 shards
incremental sync: 0/1 shards
bucket is caught up with source

$ radosgw-admin bilog list --bucket edin2z6-sharedconfig --max-entries 600000000 | grep op_id | wc -l
1299392

$ rados -p siteA.rgw.buckets.index listomapkeys .dir.033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1 | wc -l
1299083

$ radosgw-admin bucket stats --bucket=edin2z6-sharedconfig {
"bucket": "edin2z6-sharedconfig",
"num_shards": 0,
"tenant": "",
"zonegroup": "5d74eb0e-5d99-481f-ae33-43483f6cebc0",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1",
"marker": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1",
"index_type": "Normal",
"owner": "edin2z6",
"ver": "0#1622676",
"master_ver": "0#0",
"mtime": "2020-01-14 14:30:18.606142Z",
"max_marker": "0#00001622675.2115836.5",
"usage": {
"rgw.main": {
"size": 15209,
"size_actual": 40960,
"size_utilized": 15209,
"size_kb": 15,
"size_kb_actual": 40,
"size_kb_utilized": 15,
"num_objects": 7
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

$ radosgw-admin bucket limit check
... {
"bucket": "edin2z6-sharedconfig",
"tenant": "",
"num_objects": 7,
"num_shards": 0,
"objects_per_shard": 7,
"fill_status": "OK"
},
...

$ radosgw-admin zonegroup get
+ sudo docker ps --filter name=ceph-rgw-.*rgw -q
+
sudo docker exec d2c999b1f3f8 radosgw-admin {
"id": "5d74eb0e-5d99-481f-ae33-43483f6cebc0",
"name": "geored_zg",
"api_name": "geored_zg",
"is_master": "true",
"endpoints": [
"https://10.254.2.93:7480";
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "c48f33ad-6d79-4b9f-a22f-78589f67526e",
"zones": [ {
"id": "0a3c29b7-1a2c-432d-979b-d324a05cc831",
"name": "siteApubsub",
"endpoints": [
"https://10.254.2.93:7481";,
"https://10.254.2.94:7481";,
"https://10.254.2.95:7481";
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteA"
],
"redirect_zone": ""
}, {
"id": "9f5fba56-4a32-46a6-8695-89253be81614",
"name": "siteB",
"endpoints": [
"https://10.254.2.224:7480";,
"https://10.254.2.225:7480";,
"https://10.254.2.226:7480";
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}, {
"id": "c48f33ad-6d79-4b9f-a22f-78589f67526e",
"name": "siteA",
"endpoints": [
"https://10.254.2.93:7480";,
"https://10.254.2.94:7480";,
"https://10.254.2.95:7480";
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}, {
"id": "c72b3aa8-a051-4665-9421-909510702412",
"name": "siteBpubsub",
"endpoints": [
"https://10.254.2.224:7481";,
"https://10.254.2.225:7481";,
"https://10.254.2.226:7481";
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteB"
],
"redirect_zone": ""
}
],
"placement_targets": [ {
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "b7f31089-0879-4fa2-9cbc-cfdf5f866a35"
}

Actions #1

Updated by Greg Farnum almost 3 years ago

  • Project changed from Ceph to rgw
Actions

Also available in: Atom PDF