I am stuck with a very similar problem.
We recently upgraded to nautilus (two days ago), and today we noticed that all read/list/write operations where failing to one bucket - because it is stuck in resharding from 256 to 512 shards.
Log entry:
RGWReshardLock::lock failed to acquire lock on reshard.0000000003 ret=-16
The cluster was original mimic on Ubuntu16.04, then Ubuntu18.04, and now nautilus on Ubuntu18.04. All created with ceph-deploy. Started with 3 nodes, then grew to 5 and now 8. 5 mons - all active. Now we have about 200 OSDs - all healthy.
The cluster is technically in HEALTH_WARN, but only because nautilus now complains about missed scrubbing tasks). MDS and cephfs is also fine. Other buckets, than the one reported below, still work.
radosgw-admin reshard list
[
{
"time": "2019-05-29 18:27:39.012365Z",
"tenant": "",
"bucket_name": "prelabeling",
"bucket_id": "4529fe07-c707-4c2d-9ccb-1ecf8154aa23.32005164.1",
"new_instance_id": "",
"old_num_shards": 256,
"new_num_shards": 512
}
]
radosgw-admin reshard status --bucket prelabeling
[
{
"reshard_status": "CLS_RGW_RESHARD_IN_PROGRESS",
"new_bucket_instance_id": "4529fe07-c707-4c2d-9ccb-1ecf8154aa23.34570350.1",
"num_shards": 512
},
... this json object repeats exactly 256 times with the exact same content ...
]
radosgw-admin bucket stats --bucket prelabeling { "bucket": "prelabeling", "tenant": "", "zonegroup": "28036fb5-da9c-4727-a1ec-0f59f846649d", "placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "4529fe07-c707-4c2d-9ccb-1ecf8154aa23.32005164.1",
"marker": "4529fe07-c707-4c2d-9ccb-1ecf8154aa23.774162.1",
"index_type": "Normal",
"owner": "foobar",
"ver": "0#16279,1#16543,2#16448,3#16325,4#16508,5#16383,6#16438,7#16357,8#16394,9#16373,10#16544,11#16467,12#16481,13#16500,14#16326,15#16425,16#16299,17#16464,18#16624,19#16
714,20#16392,21#16772,22#16326,23#16360,24#16376,25#16531,26#16062,27#16210,28#16586,29#16323,30#16388,31#16376,32#16550,33#16378,34#16272,35#16424,36#16417,37#16535,38#16699,39#
16339,40#16484,41#16342,42#16454,43#16197,44#16548,45#16545,46#16199,47#16251,48#16310,49#16408,50#16487,51#16486,52#16564,53#16367,54#16219,55#16353,56#16412,57#16624,58#16303,5
9#16395,60#16375,61#16269,62#16353,63#16254,64#16467,65#16292,66#16622,67#16292,68#16481,69#16291,70#16644,71#16579,72#16557,73#16153,74#16454,75#16582,76#16618,77#16183,78#16558
,79#16163,80#16248,81#16339,82#16381,83#16505,84#16690,85#16438,86#16675,87#16340,88#16401,89#16393,90#16489,91#16623,92#16588,93#16413,94#16233,95#16557,96#16438,97#16428,98#164
44,99#16344,100#16595,101#16279,102#16634,103#16336,104#16572,105#16419,106#16389,107#16498,108#16556,109#16127,110#16496,111#16116,112#16222,113#16327,114#16311,115#16623,116#16
485,117#16280,118#16536,119#16403,120#16247,121#16362,122#16357,123#16427,124#16264,125#16452,126#16370,127#16420,128#16524,129#16307,130#16531,131#16209,132#16346,133#16645,134#
16360,135#16498,136#16566,137#16401,138#16252,139#16224,140#16649,141#16246,142#16372,143#16246,144#16552,145#16373,146#16342,147#16512,148#16107,149#16379,150#16435,151#16329,15
2#16356,153#16423,154#16603,155#16501,156#16577,157#16377,158#16415,159#16378,160#16303,161#16606,162#16295,163#16385,164#16345,165#16544,166#16262,167#16397,168#16601,169#16291,
170#16407,171#16408,172#16553,173#16387,174#16369,175#16419,176#16095,177#16553,178#16256,179#16284,180#16203,181#16354,182#16464,183#16355,184#16459,185#16324,186#16428,187#1621
6,188#16666,189#16546,190#16299,191#16355,192#16463,193#16611,194#16253,195#16416,196#16315,197#15741,198#15874,199#15926,200#15659,201#15820,202#16088,203#15830,204#16030,205#15
950,206#15742,207#15424,208#16057,209#15942,210#15866,211#15921,212#15898,213#15858,214#15816,215#16245,216#16019,217#15488,218#16172,219#15859,220#15840,221#15832,222#15812,223#
15770,224#15956,225#15818,226#15916,227#16088,228#15799,229#15878,230#15613,231#15863,232#15870,233#15698,234#15848,235#15862,236#15936,237#15756,238#16002,239#16023,240#15907,24
1#15625,242#15966,243#16136,244#16073,245#15843,246#16007,247#16160,248#15807,249#15973,250#15976,251#15884,252#15848,253#16040,254#15861,255#15972",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0,64#0,65#0,66#0,67#0,68#0,69#0,70#0,71#0,72#0,73#0,74#0,75#0,76#0,77#0,78#0,79#0,80#0,81#0,82#0,83#0,84#0,85#0,86#0,87#0,88#0,89#0,90#0,91#0,92#0,93#0,94#0,95#0,96#0,97#0,98#0,99#0,100#0,101#0,102#0,103#0,104#0,105#0,106#0,107#0,108#0,109#0,110#0,111#0,112#0,113#0,114#0,115#0,116#0,117#0,118#0,119#0,120#0,121#0,122#0,123#0,124#0,125#0,126#0,127#0,128#0,129#0,130#0,131#0,132#0,133#0,134#0,135#0,136#0,137#0,138#0,139#0,140#0,141#0,142#0,143#0,144#0,145#0,146#0,147#0,148#0,149#0,150#0,151#0,152#0,153#0,154#0,155#0,156#0,157#0,158#0,159#0,160#0,161#0,162#0,163
#0,164#0,165#0,166#0,167#0,168#0,169#0,170#0,171#0,172#0,173#0,174#0,175#0,176#0,177#0,178#0,179#0,180#0,181#0,182#0,183#0,184#0,185#0,186#0,187#0,188#0,189#0,190#0,191#0,192#0,193#0,194#0,195#0,196#0,197#0,198#0,199#0,200#0,201#0,202#0,203#0,204#0,205#0,206#0,207#0,208#0,209#0,210#0,211#0,212#0,213#0,214#0,215#0,216#0,217#0,218#0,219#0,220#0,221#0,222#0
,223#0,224#0,225#0,226#0,227#0,228#0,229#0,230#0,231#0,232#0,233#0,234#0,235#0,236#0,237#0,238#0,239#0,240#0,241#0,242#0,243#0,244#0,245#0,246#0,247#0,248#0,249#0,250#0,251#0,252
#0,253#0,254#0,255#0",
"mtime": "2019-05-29 18:59:36.264916",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,4
2#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#,53#,54#,55#,56#,57#,58#,59#,60#,61#,62#,63#,64#,65#,66#,67#,68#,69#,70#,71#,72#,73#,74#,75#,76#,77#,78#,79#,80#,81#,82#,83#,84#,85#,86#
,87#,88#,89#,90#,91#,92#,93#,94#,95#,96#,97#,98#,99#,100#,101#,102#,103#,104#,105#,106#,107#,108#,109#,110#,111#,112#,113#,114#,115#,116#,117#,118#,119#,120#,121#,122#,123#,124#,
125#,126#,127#,128#,129#,130#,131#,132#,133#,134#,135#,136#,137#,138#,139#,140#,141#,142#,143#,144#,145#,146#,147#,148#,149#,150#,151#,152#,153#,154#,155#,156#,157#,158#,159#,160
#,161#,162#,163#,164#,165#,166#,167#,168#,169#,170#,171#,172#,173#,174#,175#,176#,177#,178#,179#,180#,181#,182#,183#,184#,185#,186#,187#,188#,189#,190#,191#,192#,193#,194#,195#,1
96#,197#,198#,199#,200#,201#,202#,203#,204#,205#,206#,207#,208#,209#,210#,211#,212#,213#,214#,215#,216#,217#,218#,219#,220#,221#,222#,223#,224#,225#,226#,227#,228#,229#,230#,231#
,232#,233#,234#,235#,236#,237#,238#,239#,240#,241#,242#,243#,244#,245#,246#,247#,248#,249#,250#,251#,252#,253#,254#,255#",
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 2970541
},
"rgw.main": {
"size": 29584795218033,
"size_actual": 29639674630144,
"size_utilized": 29584795218033,
"size_kb": 28891401581,
"size_kb_actual": 28944994756,
"size_kb_utilized": 28891401581,
"num_objects": 22629845
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
Canceling the reshard fails:
radosgw-admin reshard cancel --bucket prelabeling
There is ongoing resharding, please retry after 2019-05-29 23:10:48.247 7ffbdddd2640 0 RGWReshardLock::lock failed to acquire lock on prelabeling:4529fe07-c707-4c2d-9ccb-1ecf8154aa23.32005164.1 ret=-16
360 seconds
(yes, there is mangled output - I suppose the "360 seconds" should come before the timestamp...)
If I try to add a reshard operation manually:
radosgw-admin reshard add --bucket prelabeling --num-shards 257
radosgw-admin reshard process
ERROR: failed to process reshard logs, error=2019-05-29 23:13:34.387 7f6006e45640 0 RGWReshardLock::lock failed to acquire lock on reshard.0000000003 ret=-16(16) Device or resource busy
The resharding is stuck for over 14 hours now.
I tried restarting all RGWs (we have 5), I now stopped all except one - in the hope that it might be a weird race between them - nothing.
Whatever I do, the main problem seems to be that it is stuck with the RGWReshardLock - and nobody is releasing it...