Bug #17574: multisite: many duplicate mdlog entries cause race to sync and result in ECANCELED - rgw - Ceph

Actions

Copy link

Bug #17574

open

multisite: many duplicate mdlog entries cause race to sync and result in ECANCELED

Added by Casey Bodley over 7 years ago. Updated about 7 years ago.

Status:

New

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Seen in multiple teuthology runs:

http://qa-proxy.ceph.com/teuthology/cbodley-2016-09-16_10:00:36-rgw-wip-cbodley-testing---basic-mira/419161/teuthology.log
- osd.2 is the culprit here, but the log doesn't include 'debug cls' http://qa-proxy.ceph.com/teuthology/cbodley-2016-09-16_10:00:36-rgw-wip-cbodley-testing---basic-mira/419161/remote/mira041/log/ceph-osd.2.log.gz

http://qa-proxy.ceph.com/teuthology/owasserm-2016-09-26_16:09:01-rgw-wip-orit-testing---basic-mira/438706/teuthology.log

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Casey Bodley over 7 years ago

Related to Bug #17465: multisite: coroutine deadlock in RGWMetaSyncCR after ECANCELED errors added

Actions

Copy link

Updated by Abhishek Lekshmanan over 7 years ago

Seeing similar errors in jewel; in delete bucket operations where the OSD returns -ECANCELLED and we return a 500

Actions

Copy link

Updated by Casey Bodley over 7 years ago

I pushed a branch to ceph-qa-suite [1] that sets debug_objclass=20 and scheduled a run of the rgw/verify suite [2].

[1] https://github.com/ceph/ceph-qa-suite/commits/wip-rgw-ecanceled
[2] http://pulpito.ceph.com/cbodley-2016-10-25_11:34:59-rgw:verify-master---basic-mira/

Actions

Copy link

Updated by Casey Bodley over 7 years ago

This issue reproduced in job 494193. The osd log [1] indicates that the majority of ECANCELED errors are coming from 'cls_version: failed condition check'. The osd operations look like '.bucket.meta.test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272 [delete,create 0~0,call version.check_conds,call version.set,writefull 0~277,setxattr user.rgw.acl (185)]' and '[call version.check_conds,call version.read,read 0~524288]'.

An interesting note about the gateway issuing these requests, its log [2] shows many duplicated mdlog entries:

2016-10-26 00:00:28.103066 35c45700 20 meta sync: remote mdlog, shard_id=47 num of shard entries: 70
2016-10-26 00:00:28.103747 35c45700 20 cr:s=0x73da9930:op=0x782087a0:24RGWCloneMetaLogCoroutine: operate()
2016-10-26 00:00:28.103801 35c45700 20 meta sync: operate: shard_id=47: storing mdlog entries
2016-10-26 00:00:28.103840 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272
2016-10-26 00:00:28.103954 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272
2016-10-26 00:00:28.104042 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271
2016-10-26 00:00:28.104122 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271
2016-10-26 00:00:28.104200 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272
2016-10-26 00:00:28.104288 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272
2016-10-26 00:00:28.104376 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272
2016-10-26 00:00:28.104463 35c45700 20 meta sync: entry: name=test-client.0-ba5gfwy6dpmh9u4-271:r0z1.4136.272

... repeated 62 more times ...

The gateway goes on to spawn a bunch of RGWMetaSyncSingleEntryCRs for these, so they're all fetching the bucket instance metadata from the master and racing to store the entry via RGWMetaStoreEntryCR.

[1] http://qa-proxy.ceph.com/teuthology/cbodley-2016-10-25_11:34:59-rgw:verify-master---basic-mira/494193/remote/mira023/log/ceph-osd.2.log.gz
[2] http://qa-proxy.ceph.com/teuthology/cbodley-2016-10-25_11:34:59-rgw:verify-master---basic-mira/494193/remote/mira032/log/rgw.client.1.log.gz

Actions

Copy link

Updated by Yehuda Sadeh over 7 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Casey Bodley over 7 years ago

Assignee set to Casey Bodley

Actions

Copy link

Updated by Casey Bodley over 7 years ago

Related to Bug #17996: rgw: ECANCELED in rgw_get_system_obj() leads to infinite loop added

Actions

Copy link

Updated by Casey Bodley over 7 years ago

Subject changed from single osd starts failing cls operations with ECANCELED to multisite: many duplicate mdlog entries cause race to sync and result in ECANCELED

Actions

Copy link

Updated by Yehuda Sadeh about 7 years ago

Discussed in bug scrub. We think that what we see there is not necessarily a bug, but s3-tests doing a lot of metadata changes. We can probably have a squash mechanism in the metadata sync similar to the one that we have in the data sync.

Actions

Copy link

#10

Updated by Yehuda Sadeh about 7 years ago

Priority changed from High to Normal

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #17574

multisite: many duplicate mdlog entries cause race to sync and result in ECANCELED

Updated by Casey Bodley over 7 years ago

Updated by Abhishek Lekshmanan over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Yehuda Sadeh over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Casey Bodley over 7 years ago

Updated by Yehuda Sadeh about 7 years ago

Updated by Yehuda Sadeh about 7 years ago