Bug #44352: pool listings are slow after deleting objects - RADOS - Ceph

Actions

Copy link

Bug #44352

open

pool listings are slow after deleting objects

Added by Paul Emmerich about 4 years ago. Updated over 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I'm seeing a weird problem on a system where the following was done:

multi-site setup with two zones
primary went offline (sync status was fully synced)
secondary was made master and default
primary was deleted

Everything's working fine, but some radosgw-admin commands are slow. radosgw startup is also slower than usual.

These commands are slow to respond:

radosgw-admin zone list
radosgw-admin zonegroup list
radosgw-admin realm list
radosgw-admin user list

They all take almost exactly one minute to complete, indicating some kind of timeout.
It's working perfectly fine except for the slowness for some commands.

It doesn't affect all commands, for example, getting the default realm/zonegroup/zone with "get" instead of "list" is as fast as usual.

Here's what it looks like with debug logging for rgw

$ radosgw-admin zone list --debug-rgw=30/30
2020-02-28 12:50:03.765 7fd109ff3700  2 RGWDataChangesLog::ChangesRenewThread: start
2020-02-28 12:50:03.765 7fd1097f2700 20 reqs_thread_entry: start
2020-02-28 12:50:25.764 7fd109ff3700  2 RGWDataChangesLog::ChangesRenewThread: start
2020-02-28 12:50:47.764 7fd109ff3700  2 RGWDataChangesLog::ChangesRenewThread: start
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got zonegroup_info.a3463f9f-2a47-4afc-b794-f8fac1904bd3
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.355a1244-2ed9-430a-91a7-50ea6b0a0287.1
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zone.f779b461-23be-4400-8e80-aac8be7aebb7
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zone.35493e32-0bac-4cbd-8524-f5f266e9b70d
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got period_config.f779b461-23be-4400-8e80-aac8be7aebb7
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got zonegroups_names.XXXX
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.4
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zonegroup.35493e32-0bac-4cbd-8524-f5f266e9b70d
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got zone_info.9a745fce-1fd4-4ff5-ba70-311aa8932d98
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.latest_epoch
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got realms_names.XXXXX
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got realms.0354ca1d-3ec8-426e-a6bf-8455c53a5971
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.realm
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got period_config.0354ca1d-3ec8-426e-a6bf-8455c53a5971
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zone.0354ca1d-3ec8-426e-a6bf-8455c53a5971
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zonegroup.0354ca1d-3ec8-426e-a6bf-8455c53a5971
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.f779b461-23be-4400-8e80-aac8be7aebb7:staging
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.355a1244-2ed9-430a-91a7-50ea6b0a0287.latest_epoch
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got default.zonegroup.f779b461-23be-4400-8e80-aac8be7aebb7
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.3
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got realms.0354ca1d-3ec8-426e-a6bf-8455c53a5971.control
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.0354ca1d-3ec8-426e-a6bf-8455c53a5971:staging
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got period_config.35493e32-0bac-4cbd-8524-f5f266e9b70d
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got period_config.e716690f-2b87-4432-aed0-24e4bb24f96e
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.2
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.1
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.27ab57cf-1845-4485-bcbd-e83972cd898f.5
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got zonegroup_info.da0c964c-6029-48cc-bfe6-4a32aaff2f34
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got zone_names.XXXX
2020-02-28 12:51:06.307 7fd12a56d9c0 20 RGWRados::pool_iterate: got periods.35493e32-0bac-4cbd-8524-f5f266e9b70d:staging
2020-02-28 12:51:06.307 7fd12a56d9c0 20 rados->read ofs=0 len=0
2020-02-28 12:51:06.311 7fd12a56d9c0 20 rados_obj.operate() r=0 bl.length=46
2020-02-28 12:51:06.311 7fd12a56d9c0 20 rados->read ofs=0 len=0
{
    "default_info": "9a745fce-1fd4-4ff5-ba70-311aa8932d98",
    "zones": [
        "XXXX" 
    ]
}
2020-02-28 12:51:06.311 7fd12a56d9c0 20 rados_obj.operate() r=0 bl.length=107
2020-02-28 12:51:06.311 7fd12a56d9c0 20 rados->read ofs=0 len=0
2020-02-28 12:51:06.311 7fd12a56d9c0 20 rados_obj.operate() r=0 bl.length=46

I've also encountered the following unusual warnings in the radosgw log:

Feb 27 18:36:28 XXX radosgw[189536]: 2020-02-27 18:36:28.220 7fb4c7d92700 -1 rgw realm watcher: RGWRealmWatcher::handle_error oid=realms.35493e32-0bac-4cbd-8524-f5f266e9b70d.control err=-107
Feb 27 18:36:28 XXX radosgw[189536]: 2020-02-27 18:36:28.220 7fb5eafd8700 -1 RGWWatcher::handle_error cookie 94472282110976 err (107) Transport endpoint is not connected
Feb 27 18:36:28 XXX radosgw[189536]: 2020-02-27 18:36:28.220 7fb4c7d92700 -1 rgw realm watcher: RGWRealmWatcher::handle_error oid=realms.35493e32-0bac-4cbd-8524-f5f266e9b70d.control err=-107
Feb 27 18:36:28 XXX radosgw[189536]: 2020-02-27 18:36:28.220 7fb5eafd8700 -1 RGWWatcher::handle_error cookie 94472282110976 err (107) Transport endpoint is not connected

Actions

Copy link

Updated by Abhishek Lekshmanan about 4 years ago

around 20s were taken to just list contents of the pool which is what happened in the debug logs, what time is taken for `rados -p .rgw.root ls`?

Actions

Copy link

Updated by Serg Protsun about 4 years ago

Abhishek Lekshmanan wrote:

around 20s were taken to just list contents of the pool which is what happened in the debug logs, what time is taken for `rados -p .rgw.root ls`?

It takes 20-30 seconds for it to complete:

$ time rados  -p .rgw.root ls >/dev/null

real    0m26.534s
user    0m0.065s
sys     0m0.058s

Actions

Copy link

Updated by Serg Protsun about 4 years ago

For other pools, it takes just a fraction of time to list objects with

rados -p <pool> ls

Actions

Copy link

Updated by Casey Bodley about 4 years ago

Project changed from rgw to RADOS
Subject changed from Some radosgw-admin commands are slow after deleting a zone to pool listings are slow after deleting objects

radosgw-admin commands are just listing the pool, and the performance degradation happens with 'rados ls' too - moving this to RADOS

Actions

Copy link

Updated by Igor Fedotov about 4 years ago

I think this is a known issue with slow [omap] listing caused by RocksDB fragmentation.
There was a bunch of improvements in Nautilus including data prefetching during reading.
Wondering what's the Ceph release for the cluster in question? Do OSDs use fast standalone DB device for underlying BlueStore if any?

Manual DB compaction can be a temporary workaround to move DB performance back to regular numbers.

Actions

Copy link

Updated by Paul Emmerich almost 4 years ago

This was on the latest nautilus release at the time, the DB should have been on SSD but I don't remember. But good pointer about that, thanks. I guess the deletion triggered a lot of changes

Actions

Copy link

Updated by Serg Protsun over 3 years ago

We compacted OSDs from time to time and it helped at some time. We then moved .rgw.root pool just to SSD drives (it was on HDD with nvme for DB+WAL) and it improved performance pretty much.

@$ time rados -p .rgw.root ls > /dev/null

real 0m0.140s
user 0m0.078s
sys 0m0.026s

$ time radosgw-admin user list > /dev/null

real 0m0.360s
user 0m0.108s
sys 0m0.080s
@

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #44352

pool listings are slow after deleting objects

Updated by Abhishek Lekshmanan about 4 years ago

Updated by Serg Protsun about 4 years ago

Updated by Serg Protsun about 4 years ago

Updated by Casey Bodley about 4 years ago

Updated by Igor Fedotov about 4 years ago

Updated by Paul Emmerich almost 4 years ago

Updated by Serg Protsun over 3 years ago