Project

General

Profile

Bug #21619

RGW Reshard error add failed to drop lock on <bucket>

Added by Yoann Moulin over 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Target version:
Start date:
10/02/2017
Due date:
% Done:

0%

Source:
Community (user)
Tags:
radosgw,reshrad
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Hello,

I have a 4TB bucket with 15M files. I'd like to reshard but I got this error :

# radosgw-admin --version
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable)

first try :

# radosgw-admin --cluster luminous bucket reshard process --bucket image-net --num-shards=150
*** NOTICE: operation will not remove old bucket index objects ***
***         these will need to be removed manually             ***
tenant:
bucket name: image-net
old bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.44353.1
new bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.275887.1
total entries: 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000
[snip]
15279000 15280000 15281000 15282000 15283000 15284000 15285000 15286000 15287000 15288000 15289000 15290000 15291000 15292000 15293000 15294000 15295000 15295330
2017-10-02 13:53:34.133092 7fd553fafcc0  0 ERROR: failed to write bucket info, ret=-125
2017-10-02 13:53:34.133106 7fd553fafcc0  0 do_reshard: failed to update bucket info ret=-125
2017-10-02 13:53:34.137261 7fd553fafcc0  0 ERROR: failed to write bucket info, ret=-125
2017-10-02 13:53:34.208338 7fd553fafcc0  0 WARNING: RGWReshard::add failed to drop lock on image-net:69d2fd65-fcf9-461b-865f-3dbb053803c4.44353.1 ret=-2

second try with the same --num-shards=150

# radosgw-admin --cluster luminous bucket reshard process --bucket image-net --num-shards=150
num shards is less or equal to current shards count
do you really mean it? (requires --yes-i-really-mean-it)

third try with --num-shards=160 (with debug_rgw=20, see attached files)

# radosgw-admin --cluster luminous bucket reshard process --bucket image-net --num-shards=160
*** NOTICE: operation will not remove old bucket index objects ***
***         these will need to be removed manually             ***
tenant: 
bucket name: image-net
old bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.266391.1
new bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.266445.1
total entries: 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 
[snip]
15284000 15285000 15286000 15287000 15288000 15289000 15290000 15291000 15292000 15293000 15294000 15295000 15295330
2017-10-02 15:07:27.436167 7f7c8aee2cc0  0 WARNING: RGWReshard::add failed to drop lock on image-net:69d2fd65-fcf9-461b-865f-3dbb053803c4.266391.1 ret=-2

bucket information :

# radosgw-admin --cluster luminous bucket stats --bucket=image-net
{
    "bucket": "image-net",
    "zonegroup": "43d23097-56b9-48a6-ad52-de42341be4bd",
    "placement_rule": "default-placement",
    "explicit_placement": {
        "data_pool": "",
        "data_extra_pool": "",
        "index_pool": "" 
    },
    "id": "69d2fd65-fcf9-461b-865f-3dbb053803c4.266445.1",
    "marker": "69d2fd65-fcf9-461b-865f-3dbb053803c4.44353.1",
    "index_type": "Normal",
    "owner": "rgwadmin",
    "ver": "0#1521,1#1514,2#1521,3#1481,4#1485,5#1489,6#1481,7#1482,8#1496,9#1492,10#1481,11#1485,12#1491,13#1514,14#1488,15#1490,16#1489,17#1486,18#1498,19#1490,20#1485,21#1482,22#1490,23#1480,24#1517,25#1491,26#1485,27#1493,28#1500,29#1486,30#1488,31#1482,32#1474,33#1487,34#1488,35#1515,36#1485,37#1491,38#1498,39#1489,40#1487,41#1485,42#1490,43#1488,44#1494,45#1492,46#1520,47#1487,48#1489,49#1489,50#1487,51#1492,52#1490,53#1491,54#1489,55#1487,56#1476,57#1521,58#1486,59#1486,60#1477,61#1481,62#1489,63#1484,64#1490,65#1491,66#1486,67#1493,68#1525,69#1518,70#1514,71#1522,72#1514,73#1519,74#1518,75#1515,76#1516,77#1518,78#1523,79#1519,80#1521,81#1517,82#1526,83#1515,84#1515,85#1525,86#1522,87#1511,88#1522,89#1515,90#1520,91#1494,92#1488,93#1490,94#1520,95#1489,96#1484,97#1487,98#1487,99#1483,100#1489,101#1485,102#1486,103#1480,104#1492,105#1514,106#1479,107#1498,108#1486,109#1483,110#1483,111#1488,112#1483,113#1500,114#1482,115#1492,116#1525,117#1494,118#1480,119#1481,120#1481,121#1474,122#1484,123#1492,124#1487,125#1494,126#1484,127#1521,128#1490,129#1493,130#1497,131#1498,132#1492,133#1490,134#1488,135#1493,136#1486,137#1498,138#1518,139#1484,140#1486,141#1484,142#1490,143#1496,144#1494,145#1487,146#1485,147#1485,148#1492,149#1520,150#1500,151#1493,152#1497,153#1485,154#1489,155#1489,156#1495,157#1491,158#1484,159#1492",
    "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0,64#0,65#0,66#0,67#0,68#0,69#0,70#0,71#0,72#0,73#0,74#0,75#0,76#0,77#0,78#0,79#0,80#0,81#0,82#0,83#0,84#0,85#0,86#0,87#0,88#0,89#0,90#0,91#0,92#0,93#0,94#0,95#0,96#0,97#0,98#0,99#0,100#0,101#0,102#0,103#0,104#0,105#0,106#0,107#0,108#0,109#0,110#0,111#0,112#0,113#0,114#0,115#0,116#0,117#0,118#0,119#0,120#0,121#0,122#0,123#0,124#0,125#0,126#0,127#0,128#0,129#0,130#0,131#0,132#0,133#0,134#0,135#0,136#0,137#0,138#0,139#0,140#0,141#0,142#0,143#0,144#0,145#0,146#0,147#0,148#0,149#0,150#0,151#0,152#0,153#0,154#0,155#0,156#0,157#0,158#0,159#0",
    "mtime": "2017-10-02 15:07:26.589474",
    "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#,53#,54#,55#,56#,57#,58#,59#,60#,61#,62#,63#,64#,65#,66#,67#,68#,69#,70#,71#,72#,73#,74#,75#,76#,77#,78#,79#,80#,81#,82#,83#,84#,85#,86#,87#,88#,89#,90#,91#,92#,93#,94#,95#,96#,97#,98#,99#,100#,101#,102#,103#,104#,105#,106#,107#,108#,109#,110#,111#,112#,113#,114#,115#,116#,117#,118#,119#,120#,121#,122#,123#,124#,125#,126#,127#,128#,129#,130#,131#,132#,133#,134#,135#,136#,137#,138#,139#,140#,141#,142#,143#,144#,145#,146#,147#,148#,149#,150#,151#,152#,153#,154#,155#,156#,157#,158#,159#",
    "usage": {
        "rgw.main": {
            "size": 4968679552139,
            "size_actual": 5001265569792,
            "size_utilized": 0,
            "size_kb": 4852226126,
            "size_kb_actual": 4884048408,
            "size_kb_utilized": 0,
            "num_objects": 15295312
        },
        "rgw.multimeta": {
            "size": 0,
            "size_actual": 0,
            "size_utilized": 0,
            "size_kb": 0,
            "size_kb_actual": 0,
            "size_kb_utilized": 0,
            "num_objects": 18
        }
    },
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    }
}

Yoann

luminous-rgw-iccluster023.log View - Radosgw debug log #2 (61.9 KB) Yoann Moulin, 10/02/2017 01:32 PM

luminous-rgw-iccluster015.log View - Radosgw debug log #1 (62.3 KB) Yoann Moulin, 10/02/2017 01:32 PM

luminous.conf View - ceph cluster configuration (2.11 KB) Yoann Moulin, 10/02/2017 01:34 PM


Related issues

Copied to rgw - Backport #23687: luminous: RGW Reshard error add failed to drop lock on <bucket> Resolved

History

#1 Updated by Yehuda Sadeh over 1 year ago

Does this keep happening, or does it fix itself after 2 minutes?
Looks like there was a racing change that went into the bucket while doing the reshard.

#2 Updated by Yoann Moulin over 1 year ago

Yehuda Sadeh wrote:

Does this keep happening, or does it fix itself after 2 minutes?
Looks like there was a racing change that went into the bucket while doing the reshard.

I have just retried now, and I have the same error.

root@iccluster007:~# radosgw-admin --cluster luminous bucket reshard process --bucket image-net --num-shards=160 --yes-i-really-mean-it
*** NOTICE: operation will not remove old bucket index objects ***
***         these will need to be removed manually             ***
tenant: 
bucket name: image-net
old bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.266445.1
new bucket instance id: 69d2fd65-fcf9-461b-865f-3dbb053803c4.287912.1
total entries: 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000
[snip]
15280000 15281000 15282000 15283000 15284000 15285000 15286000 15287000 15288000 15289000 15290000 15291000 15292000 15293000 15294000 15295000 15295330
2017-10-05 20:36:56.100673 7f316a4d8cc0  0 WARNING: RGWReshard::add failed to drop lock on image-net:69d2fd65-fcf9-461b-865f-3dbb053803c4.266445.1 ret=-2

but it seems that it work in fact :

root@iccluster007:~# radosgw-admin --cluster luminous metadata get bucket:image-net | grep bucket_id
            "bucket_id": "69d2fd65-fcf9-461b-865f-3dbb053803c4.287912.1",
root@iccluster007:~# rados  --cluster luminous -p default.rgw.buckets.index ls | grep 44353.1 | wc -l
1
root@iccluster007:~# rados  --cluster luminous -p default.rgw.buckets.index ls | grep 275887.1 | wc -l
150
root@iccluster007:~# rados  --cluster luminous -p default.rgw.buckets.index ls | grep 266445.1 | wc -l
160
root@iccluster007:~# rados  --cluster luminous -p default.rgw.buckets.index ls | grep 287912.1 | wc -l
160

So if I understood well, the resharding seems to have completed successfully but there is an error (which is only a warning) on a lock on the old bucket_id.

Yoann

#3 Updated by Orit Wasserman over 1 year ago

Are you using multisite?

#4 Updated by Orit Wasserman over 1 year ago

  • Status changed from New to Need More Info

#5 Updated by Orit Wasserman over 1 year ago

  • Assignee set to Orit Wasserman

#6 Updated by Yoann Moulin over 1 year ago

Orit Wasserman wrote:

Are you using multisite?

No, I have created a realm and set zone and zonegroup as default and associate to the realm but no multisite is configured.
I have 3 radosgw configured in a DNS roundrobin.

#7 Updated by Yehuda Sadeh over 1 year ago

We're working on adding a radosgw-admin reshard abort command that would deal with such issues.

#8 Updated by Yehuda Sadeh over 1 year ago

  • Status changed from Need More Info to In Progress

#9 Updated by Yehuda Sadeh about 1 year ago

  • Priority changed from Normal to High

#10 Updated by Orit Wasserman 12 months ago

reshard cancel will clear resharding flag:
https://github.com/ceph/ceph/pull/21120

#11 Updated by Orit Wasserman 12 months ago

  • Status changed from In Progress to Need Review

#12 Updated by Orit Wasserman 12 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to luminous

#13 Updated by Nathan Cutler 12 months ago

  • Copied to Backport #23687: luminous: RGW Reshard error add failed to drop lock on <bucket> added

#14 Updated by Abhishek Lekshmanan 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF