Project

General

Profile

Actions

Bug #24937

closed

[rgw] Very high cache misses with automatic bucket resharding

Added by Aleksandr Rudenko almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi, guys.

I use Luminous 12.2.5.

Automatic bucket index resharding has not been activated in the past.

Few days ago i activated auto. resharding.

After that and now i see:

- very high Ceph read I/O (~300 I/O before activating resharding, ~4k now),
- very high Ceph read bandwidth (50 MB/s before activating resharding, 250 MB/s now),
- very high RGW cache miss (400 count/s before activating resharding, ~3.5k now).

For Ceph monitoring i use MGR+Zabbix plugin and zabbix-template from ceph github repo.
For RGW monitoring i use RGW perf dump and my script.


Files

RGW cache misses.png (116 KB) RGW cache misses.png high RGW cache misses Aleksandr Rudenko, 07/16/2018 09:56 AM
Ceph High IO.png (190 KB) Ceph High IO.png high Ceph IO Aleksandr Rudenko, 07/16/2018 09:56 AM

Related issues 1 (0 open1 closed)

Related to rgw - Bug #27219: lock in resharding may expires before the dynamic resharding completesResolvedJ. Eric Ivancich08/24/2018

Actions
Actions #1

Updated by Aleksandr Rudenko almost 6 years ago

I think i have this problem:
RGW Dynamic bucket index resharding keeps resharding all buckets - https://tracker.ceph.com/issues/24551?next_issue_id=24546&prev_issue_id=24562

I think RGW was resharding buckets over and over again but in my case this reproduced on versioning disabled buckets:

radosgw-admin reshard list
...
   {
        "time": "2018-07-17 11:08:20.336354Z",
        "tenant": "",
        "bucket_name": "bucket-name",
        "bucket_id": "default.32785769.2",
        "new_instance_id": "",
        "old_num_shards": 1,
        "new_num_shards": 161
    },
...
radosgw-admin bucket limit check
... 
           {
                "bucket": "bucket-name",
                "tenant": "",
                "num_objects": 20840702,
                "num_shards": 161,
                "objects_per_shard": 129445,
                "fill_status": "OVER 100.000000%" 
            },
...

Actions #2

Updated by J. Eric Ivancich over 5 years ago

This may be related to the problem addressed by http://tracker.ceph.com/issues/27219 . The problem there was that due to high load, resharding could not complete before the resharding lock expired. This PR does a number of things to address this, including renewing the lock periodically to allow resharding to complete.

Actions #3

Updated by J. Eric Ivancich over 5 years ago

  • Status changed from New to Pending Backport
  • Assignee set to J. Eric Ivancich

This PR (https://github.com/ceph/ceph/pull/24898) is a luminous backport of a bug fix that resolved this in both master and downstream ceph. The bug-fix will allow resharding to complete if it's taking too long. It does this by periodically renewing the reshard lock. Previously the reshard lock could be lost and another reshard job started, thereby creating the problem described.

Actions #4

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved

Backports are going via #27219

Actions #5

Updated by Nathan Cutler over 5 years ago

  • Related to Bug #27219: lock in resharding may expires before the dynamic resharding completes added
Actions

Also available in: Atom PDF