Project

General

Profile

Actions

Bug #27219

closed

lock in resharding may expires before the dynamic resharding completes

Added by Jeegn Chen over 5 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Normal
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Orit Wasserman <> 于2018年8月19日周日 下午5:38写道:

Hi,
On Thu, Aug 16, 2018 at 8:11 PM Yehuda Sadeh-Weinraub <> wrote:

Hi,

I don't remember the exact details right now, but we might be renewing it periodically as reshard happens. Orit?

Looking at the code we do not renew the lock and we should have, Jeegn please open a tracker issue for this.
This won't cause corruption as we complete the resharding regardless of the lock and a new thread will use a different bucket instance so it won't corrupt this one. This still wastes lots of system resources for nothing. At the moment we our difualt is a one thread so it could be only a different Radosgw instance but still possible.

Regards,
Orit

On Wed, Aug 15, 2018, 6:17 AM Jeegn Chen <> wrote:

Hi Yehuda,

I just took a look at the code related to Dynamic Resharding in
Luminous. Not sure whether I'm correct but my impression is that
Dynamic Resharding logic does not address buckets with large number of
objects properly, especially when there are multiple RGW processes.

My major concern comes from the short lock expiration. For example,
the expiration time is just 60 seconds in
RGWReshard::process_single_logshard(). If RGWBucketReshard::execute()
following the lock acquisition takes long time to deal with large
buckets, the 60-seconds will not be enough (If a bucket is large
enough, hours may even not be enough). As a result, another Dynamic
Resharding thread in another RGW process may grab the log shard and
work on the same buckets at the same time, which potentially cause
corruption.

Did I misunderstand something or is my concern valid?

int RGWReshard::process_single_logshard(int logshard_num) {
string marker;
bool truncated = true;

CephContext *cct = store->ctx();
int max_entries = 1000;
int max_secs = 60;<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

rados::cls::lock::Lock l(reshard_lock_name);

utime_t time(max_secs, 0);<<<<<<<<<<<<<<<<<<<<<<<<
l.set_duration(time);<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

char cookie_buf[COOKIE_LEN + 1];
gen_rand_alphanumeric(store->ctx(), cookie_buf, sizeof(cookie_buf) - 1);
cookie_buf[COOKIE_LEN] = '\0';

l.set_cookie(cookie_buf);

string logshard_oid;
get_logshard_oid(logshard_num, &logshard_oid);

int ret = l.lock_exclusive(&store->reshard_pool_ctx, logshard_oid);

Thanks,
Jeegn


Related issues 4 (0 open4 closed)

Related to rgw - Bug #24551: RGW Dynamic bucket index resharding keeps resharding all bucketsResolvedJ. Eric Ivancich06/18/2018

Actions
Related to rgw - Bug #24937: [rgw] Very high cache misses with automatic bucket reshardingResolvedJ. Eric Ivancich07/16/2018

Actions
Copied to rgw - Backport #36687: mimic: lock in resharding may expires before the dynamic resharding completesResolvedJ. Eric IvancichActions
Copied to rgw - Backport #36688: luminous: lock in resharding may expires before the dynamic resharding completesResolvedJ. Eric IvancichActions
Actions

Also available in: Atom PDF