Project

General

Profile

Actions

Bug #39485

open

Luminous: a huge bucket stuck in dynamic resharding for one week

Added by Rui Xu about 5 years ago. Updated almost 5 years ago.

Status:
Need More Info
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
rgw
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph version Luminous 12.2.5
It has 3 monitors, 3 rgw gateway, 436 bluestore osds, and nvme disk osd are 24, ssd disk osd are 72, sata disk osd are 340
We use ceph in rgw with s3. Thare are some huge buckets in this cluster, the lagest bucket has 470 million objects in it.

The problem is that a bucket has 100 million objects stuck in dynamic resharding, and it progress 1024 to 2048. It has hang for 7 days. Dynamic resharding in this cluster does not need so much time. But this time it seems to endlesss.
The resharding bucket can only read, can't write any more. When you are putting any files, rgw gateway logs "NOTICE: reshard still in progress, retrying"

radosgw-admin command check sharding is : radosgw-admin bucket limit check command output is that it already has 2048 shards and fill status is ok; but radosgw-admin reshard list output is that sharding progcess is in process

output are:

[root@ceph29 ceph_rgw_debug]# radosgw-admin bucket stats --bucket mpilot-data-s3 | head -n 11
{
    "bucket": "mpilot-data-s3",
    "zonegroup": "b8176099-351f-4a8a-a8aa-24a2623ead53",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": "" 
    },
    "id": "0089274c-7a8b-4e66-83dd-d45e638415d7.52478916.4",
    "marker": "0089274c-7a8b-4e66-83dd-d45e638415d7.21212512.1",
[root@ceph29 ceph_rgw_debug]# radosgw-admin reshard list
[
    {
        "time": "2019-04-18 07:20:08.878365Z",
        "tenant": "",
        "bucket_name": "mpilot-data-s3",
        "bucket_id": "0089274c-7a8b-4e66-83dd-d45e638415d7.46222245.1",
        "new_instance_id": "mpilot-data-s3:0089274c-7a8b-4e66-83dd-d45e638415d7.80468580.2",
        "old_num_shards": 1024,
        "new_num_shards": 2048
    }
]
[root@ceph29 ceph_rgw_debug]# radosgw-admin bucket limit check --uid mpilot-admin
[
    {
        "user_id": "mpilot-admin",
        "buckets": [
            {
                "bucket": "mpilot-data-s3",
                "tenant": "",
                "num_objects": 102940439,
                "num_shards": 2048,
                "objects_per_shard": 50263,
                "fill_status": "OK" 
            }
        ]
    }
]

rgw gateway debug_rgw 30 log is at attachment
Summary log is

2019-04-19 03:36:22.894125 7f7a78607700  1 ====== starting new request req=0x7f7a78601190 =====
2019-04-19 03:36:22.895619 7f7ab7685700  0 block_while_resharding ERROR: bucket is still resharding, please retry
2019-04-19 03:36:22.895753 7f7ab7685700  0 WARNING: set_req_state_err err_no=2300 resorting to 500
2019-04-19 03:36:22.895847 7f7ab7685700  0 ERROR: RESTFUL_IO(s)->complete_header() returned err=Input/output error
2019-04-19 03:36:22.895910 7f7ab7685700  1 ====== req done req=0x7f7ab767f190 op status=-2300 http_status=500 ======
2019-04-19 03:36:22.895974 7f7ab7685700  1 civetweb: 0x5572b7a60000: 172.16.10.221 - - [19/Apr/2019:03:28:52 +0800] "PUT /mpilot-data-s3/itg_test/190417T115433_mkz1-a_mkzX2-c/raw_images/1555485845_fisheye_cam1/1555485847.753603.s1.jpg HTTP/1.1" 500 0 - aws-sdk-go/1.15.56 (go1.10.3; linux; amd64) S3Manager


Files

ceph-client.rgw.radosgw-ceph6.log (455 KB) ceph-client.rgw.radosgw-ceph6.log rgw resharding Rui Xu, 04/25/2019 12:33 PM
Actions

Also available in: Atom PDF