Bug #27219: lock in resharding may expires before the dynamic resharding completes - rgw - Ceph

Actions

Copy link

Bug #27219

closed

lock in resharding may expires before the dynamic resharding completes

Added by Jeegn Chen over 5 years ago. Updated over 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

J. Eric Ivancich

Target version:

% Done:

Source:

Tags:

Backport:

mimic,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Orit Wasserman <owasserm@redhat.com> 于2018年8月19日周日下午5:38写道：

Hi,
On Thu, Aug 16, 2018 at 8:11 PM Yehuda Sadeh-Weinraub <yehuda@redhat.com> wrote:

Hi,

I don't remember the exact details right now, but we might be renewing it periodically as reshard happens. Orit?

Looking at the code we do not renew the lock and we should have, Jeegn please open a tracker issue for this.
This won't cause corruption as we complete the resharding regardless of the lock and a new thread will use a different bucket instance so it won't corrupt this one. This still wastes lots of system resources for nothing. At the moment we our difualt is a one thread so it could be only a different Radosgw instance but still possible.

Regards,
Orit

On Wed, Aug 15, 2018, 6:17 AM Jeegn Chen <jeegnchen@gmail.com> wrote:

Hi Yehuda,

I just took a look at the code related to Dynamic Resharding in
Luminous. Not sure whether I'm correct but my impression is that
Dynamic Resharding logic does not address buckets with large number of
objects properly, especially when there are multiple RGW processes.

My major concern comes from the short lock expiration. For example,
the expiration time is just 60 seconds in
RGWReshard::process_single_logshard(). If RGWBucketReshard::execute()
following the lock acquisition takes long time to deal with large
buckets, the 60-seconds will not be enough (If a bucket is large
enough, hours may even not be enough). As a result, another Dynamic
Resharding thread in another RGW process may grab the log shard and
work on the same buckets at the same time, which potentially cause
corruption.

Did I misunderstand something or is my concern valid?

int RGWReshard::process_single_logshard(int logshard_num) {
string marker;
bool truncated = true;

CephContext *cct = store->ctx();
int max_entries = 1000;
int max_secs = 60;<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

rados::cls::lock::Lock l(reshard_lock_name);

utime_t time(max_secs, 0);<<<<<<<<<<<<<<<<<<<<<<<<
l.set_duration(time);<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

char cookie_buf[COOKIE_LEN + 1];
gen_rand_alphanumeric(store->ctx(), cookie_buf, sizeof(cookie_buf) - 1);
cookie_buf[COOKIE_LEN] = '\0';

l.set_cookie(cookie_buf);

string logshard_oid;
get_logshard_oid(logshard_num, &logshard_oid);

int ret = l.lock_exclusive(&store->reshard_pool_ctx, logshard_oid);

Thanks,
Jeegn

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #27219

lock in resharding may expires before the dynamic resharding completes

Updated by Orit Wasserman over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Jorge Campos over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Orit Wasserman over 5 years ago

Updated by Jorge Campos over 5 years ago

Updated by Jorge Campos over 5 years ago

Updated by J. Eric Ivancich over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago