Project

General

Profile

Bug #24937

[rgw] Very high cache misses with automatic bucket resharding

Added by Aleksandr Rudenko over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph github repo.
For RGW monitoring i use RGW perf dump and my script.

RGW cache misses.png View - high RGW cache misses (116 KB) Aleksandr Rudenko, 07/16/2018 09:56 AM

Ceph High IO.png View - high Ceph IO (190 KB) Aleksandr Rudenko, 07/16/2018 09:56 AM


Related issues

Related to rgw - Bug #27219: lock in resharding may expires before the dynamic resharding completes Resolved 08/24/2018

History

#1 Updated by Aleksandr Rudenko over 1 year ago

I think i have this problem:
RGW Dynamic bucket index resharding keeps resharding all buckets - https://tracker.ceph.com/issues/24551?next_issue_id=24546&prev_issue_id=24562

I think RGW was resharding buckets over and over again but in my case this reproduced on versioning disabled buckets:

radosgw-admin reshard list
...
   {
        "time": "2018-07-17 11:08:20.336354Z",
        "tenant": "",
        "bucket_name": "bucket-name",
        "bucket_id": "default.32785769.2",
        "new_instance_id": "",
        "old_num_shards": 1,
        "new_num_shards": 161
    },
...
radosgw-admin bucket limit check
... 
           {
                "bucket": "bucket-name",
                "tenant": "",
                "num_objects": 20840702,
                "num_shards": 161,
                "objects_per_shard": 129445,
                "fill_status": "OVER 100.000000%" 
            },
...

#2 Updated by Eric Ivancich over 1 year ago

This may be related to the problem addressed by http://tracker.ceph.com/issues/27219 . The problem there was that due to high load, resharding could not complete before the resharding lock expired. This PR does a number of things to address this, including renewing the lock periodically to allow resharding to complete.

#3 Updated by Eric Ivancich over 1 year ago

  • Status changed from New to Pending Backport
  • Assignee set to Eric Ivancich

This PR (https://github.com/ceph/ceph/pull/24898) is a luminous backport of a bug fix that resolved this in both master and downstream ceph. The bug-fix will allow resharding to complete if it's taking too long. It does this by periodically renewing the reshard lock. Previously the reshard lock could be lost and another reshard job started, thereby creating the problem described.

#4 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

Backports are going via #27219

#5 Updated by Nathan Cutler over 1 year ago

  • Related to Bug #27219: lock in resharding may expires before the dynamic resharding completes added

Also available in: Atom PDF