Support #23839
openRGW GC Stuxk
0%
Description
We are currently using Jewel 10.2.7 and recently, we have been experiencing some issues with objects being deleted using the gc. After a bucket was unsuccessfully deleted using –purge-objects (first error next discussed occurred), all of the rgw’s are occasionally becoming unresponsive and require a restart of the processes before they will accept requests again. On investigation of the garbage collection, it has an enormous list which we are struggling to count the length of, but seem stuck on a particular object which is not updating, shown in the logs below:
2018-04-23 15:16:04.101660 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.290071.4_XXXXXXX/XXXX/XX/XX/XXXXXXX.ZIP
2018-04-23 15:16:04.104231 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_1
2018-04-23 15:16:04.105541 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_2
2018-04-23 15:16:04.176235 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_3
2018-04-23 15:16:04.178435 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_4
2018-04-23 15:16:04.250883 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_5
2018-04-23 15:16:04.297912 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_6
2018-04-23 15:16:04.298803 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_7
2018-04-23 15:16:04.320202 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_8
2018-04-23 15:16:04.340124 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_9
2018-04-23 15:16:04.383924 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_10
2018-04-23 15:16:04.386865 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_11
2018-04-23 15:16:04.389067 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_12
2018-04-23 15:16:04.413938 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_13
2018-04-23 15:16:04.487977 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.bxz6tqhZzqZozTFkxPVspHfIhhVxaj5_14
2018-04-23 15:16:04.544235 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_1
2018-04-23 15:16:04.546978 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_2
2018-04-23 15:16:04.598644 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_3
2018-04-23 15:16:04.629519 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_4
2018-04-23 15:16:04.700492 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_5
2018-04-23 15:16:04.765798 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_6
2018-04-23 15:16:04.772774 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_7
2018-04-23 15:16:04.846379 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_8
2018-04-23 15:16:04.935023 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_9
2018-04-23 15:16:04.937229 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_10
2018-04-23 15:16:04.968289 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_11
2018-04-23 15:16:05.005194 7f1fdcc29a00 0 gc::process: removing .rgw.buckets:default.175209462.16__shadow_.06ry24pXQW8yH8EJpoqjEtZF6M6tiUv_12
We seem completely unable to get this deleted, and nothing else of immediate concern is flagging up as a potential cause of all RGWs become unresponsive at the same time. On the bucket containing this object (the one we originally tried to purge), I have attempted a further purge passing the “—bypass-gc” parameter to it, but this also resulted in all rgws becoming unresponsive within 30 minutes and so I terminated the operation and restarted the rgws again.
The bucket we attempted to remove has no shards and I have attached the details below. 90% of the contents of the bucket have already been successfully removed to our knowledge, and the bucket had no sharding (old bucket, sharding is now active for new buckets).
root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket stats --bucket=xxxxxxxxxxxx
{
"bucket": "xxxxxxxxxxxx",
"pool": ".rgw.buckets",
"index_pool": ".rgw.buckets.index",
"id": "default.290071.4",
"marker": "default.290071.4",
"owner": "yyyyyyyyyy",
"ver": "0#107938549",
"master_ver": "0#0",
"mtime": "2014-10-24 14:58:48.955805",
"max_marker": "0#",
"usage": {
"rgw.none": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 0
},
"rgw.main": {
"size_kb": 186685939,
"size_kb_actual": 189914068,
"num_objects": 1419528
},
"rgw.multimeta": {
"size_kb": 0,
"size_kb_actual": 0,
"num_objects": 24
}
},
"bucket_quota": {
"enabled": false,
"max_size_kb": -1,
"max_objects": -1
}
}
If anyone has any thoughts, they’d be greatly appreciated!
Kind Regards,
Updated by sean redmond about 6 years ago
To update this case, The cluster was updated to 10.2.10 and a inconsistent PG was found in .rgw.bucket.index - once repaired the GC process seems to be progressing. - It appears it may have been broken for some time - in future releases does ceph health update if the GC backlog is very high?
Updated by David Turner almost 6 years ago
In Luminous 12.2.2 we had a GC backlog of over 200M objects and there was no notification from the cluster that this was the case. Our GC was using 40% of our available cluster space. I think this would be a very useful thing to add to the output of the cluster status or somehow discoverable without doing weird grep's and wc's from the output of listing the gc which can take longer than a day if it gets large enough.