Bug #46563: Metadata synchronization failed，"metadata is behind on 1 shards" appear - rgw - Ceph

Actions

Copy link

Bug #46563

closed

Metadata synchronization failed，"metadata is behind on 1 shards" appear

Added by geng jichao almost 4 years ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

100%

Source:

Development

Tags:

multisite backport_processed

Backport:

octopus pacific quincy

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v12.2.12

ceph-qa-suite:

Pull request ID:

46148

Crash signature (v1):

Crash signature (v2):

Description

I have two clusters,opened multisite.The master cluster is 3,7,13,the secondary cluster is 17,18,19.
Run "radosgs-admin sync status" on the secondary clusters, it print:

[root@TX-97-140-18 ceph]# radosgw-admin sync status
realm e39a7384-6dde-4fd9-aac0-a6eb58ec7f97 (realm_sync)
zonegroup 8cd3544a-6fa3-44d2-af7e-6d46da58f2a4 (zonegroup_sync)
zone a942cd1d-9152-443a-9942-a0de0b519979 (zone_sync2)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is behind on 1 shards
behind shards: [56]
oldest incremental change not applied: 2020-07-15 09:47:55.0.413192s
data sync source: bd57ecf2-e985-4446-b544-44f07fbe0a44 (zone_sync1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

it looks like something wrong with shard 56, run "radosgw-admin metadata sync status" on the secondary cluster,
_ {
"key": 56,
"val": {
"state": 1,
"marker": "1_1594730086.231149_132.1",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2020-07-14 12:34:46.231149Z",
"realm_epoch": 2
}
},_

than I open the secondary rgw log, the key print is as follows:

2020-07-15 20:32:47.839164 7f9de911f700 1 ====== starting new request req=0x7f9de9118f90 =====
2020-07-15 20:32:48.093578 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093581 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555591c71800:18RGWMetaSyncShardCR: operate()
2020-07-15 20:32:48.093583 7f9effb4c700 20 meta sync: incremental_sync:1722: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 max_marker= sync_marker.marker= period_marker=
2020-07-15 20:32:48.093587 7f9effb4c700 20 meta sync: incremental_sync:1656: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker= period_marker=
2020-07-15 20:32:48.093589 7f9effb4c700 20 meta sync: incremental_sync:1677: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker=
2020-07-15 20:32:48.093595 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093596 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093599 7f9effb4c700 20 enqueued request req=0x555597f34540
2020-07-15 20:32:48.093600 7f9effb4c700 20 RGWWQ:
2020-07-15 20:32:48.093601 7f9effb4c700 20 req: 0x555597f34540
2020-07-15 20:32:48.093605 7f9effb4c700 20 run: stack=0x5555918ba8c0 is io blocked
2020-07-15 20:32:48.093608 7f9f0eb6a700 20 dequeued request req=0x555597f34540
2020-07-15 20:32:48.093611 7f9f0eb6a700 20 RGWWQ: empty
2020-07-15 20:32:48.093925 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093929 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093931 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093932 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093934 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555591c71800:18RGWMetaSyncShardCR: operate()
2020-07-15 20:32:48.093936 7f9effb4c700 20 meta sync: incremental_sync:1722: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 max_marker= sync_marker.marker= period_marker=
2020-07-15 20:32:48.093938 7f9effb4c700 20 meta sync: incremental_sync:1656: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker= period_marker=
2020-07-15 20:32:48.093940 7f9effb4c700 20 meta sync: incremental_sync:1677: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker=
2020-07-15 20:32:48.093945 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093946 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093949 7f9effb4c700 20 enqueued request req=0x555597f34540

it can be seen from, the secondary cluster can not read more log entry from the meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56,

run "rados -p zone_sync2.rgw.log listomapkeys meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the secondary cluster,it print nothing,

run "rados -p zone_sync1.rgw.log listomapkeys meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the master cluster,it print

1_1594777675.413192_15857.1
1_1594777675.417050_15858.1

run "rados -p zone_sync1.rgw.log listomapvals meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the master cluster,it print
1_1594777675.413192_15857.1
value (136 bytes) :
00000000 02 01 82 00 00 00 04 00 00 00 75 73 65 72 05 00 |..........user..|
00000010 00 00 75 73 65 72 35 4b 60 0e 5f c5 d1 a0 18 46 |..user5K`.....F|
00000020 00 00 00 01 01 40 00 00 00 01 01 0c 00 00 00 00 |.....@..........|
00000030 00 00 00 00 00 00 00 00 00 00 00 01 01 24 00 00 |.............$..|
00000040 00 01 00 00 00 00 00 00 00 18 00 00 00 5f 4f 74 |............._Ot|
00000050 36 32 6b 49 58 48 75 32 79 78 46 5a 65 63 45 50 |62kIXHu2yxFZecEP|
00000060 6f 52 71 78 51 01 00 00 00 1b 00 00 00 31 5f 31 |oRqxQ........1_1|
00000070 35 39 34 37 37 37 36 37 35 2e 34 31 33 31 39 32 |594777675.413192|
00000080 5f 31 35 38 35 37 2e 31 |_15857.1|
00000088

1_1594777675.417050_15858.1
value (136 bytes) :
00000000 02 01 82 00 00 00 04 00 00 00 75 73 65 72 05 00 |..........user..|
00000010 00 00 75 73 65 72 35 4b 60 0e 5f 34 b1 db 18 46 |..user5K`._4...F|
00000020 00 00 00 01 01 40 00 00 00 01 01 0c 00 00 00 00 |.....@..........|
00000030 00 00 00 00 00 00 00 00 00 00 00 01 01 24 00 00 |.............$..|
00000040 00 01 00 00 00 00 00 00 00 18 00 00 00 5f 4f 74 |............._Ot|
00000050 36 32 6b 49 58 48 75 32 79 78 46 5a 65 63 45 50 |62kIXHu2yxFZecEP|
00000060 6f 52 71 78 51 04 00 00 00 1b 00 00 00 31 5f 31 |oRqxQ........1_1|
00000070 35 39 34 37 37 37 36 37 35 2e 34 31 37 30 35 30 |594777675.417050|
00000080 5f 31 35 38 35 38 2e 31 |_15858.1|
00000088

run "radosgs-admin user list" on secondary cluster,it print
[
"user6",
"user2",
"jrsssys",
"sync-user",
"user3",
"user1",
"user4"
]

run "radosgs-admin user list" on master cluster,it print
[
"user6",
"user2",
"jrsssys",
"sync-user",
"user3",
"user1",
"user4",
"user5"
],_

the secondary clusters miss user5, the master cluster have record the user5`s log entry,but secondary cluster donot fetch the log entry from master cluster,
analysis secondary cluster rgw log, the "meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" shard donot have any log entry, it should
read more log entry from the master cluster,but it donot read,because the max_marker is empty, it can't execute RGWCloneMetaLogCoroutine,
the code is as follows

_if (mdlog_marker <= max_marker) { /* we're at the tip, try to bring more entries */ ldout(sync_env->cct, 20) << __func__ << ":" << __LINE__ << ": shard_id=" << shard_id << " syncing mdlog for shard_id=" << shard_id << dendl; yield call(new RGWCloneMetaLogCoroutine(sync_env, mdlog, period, shard_id, mdlog_marker, &mdlog_marker)); }_

In order to solve this bug, the modified code is as follows
_if (mdlog_marker <= max_marker ||!truncated) { /* we're at the tip, try to bring more entries */ ldout(sync_env->cct, 20) << __func__ << ":" << __LINE__ << ": shard_id=" << shard_id << " syncing mdlog for shard_id=" << shard_id << dendl; yield call(new RGWCloneMetaLogCoroutine(sync_env, mdlog, period, shard_id, mdlog_marker, &mdlog_marker)); }_

when truncated is false, it means that all local log entry have been processed, it should start reading the master cluster log.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #46563

Metadata synchronization failed，"metadata is behind on 1 shards" appear

Updated by Casey Bodley almost 4 years ago

Updated by Casey Bodley almost 4 years ago

Updated by Casey Bodley almost 4 years ago

Updated by Christian Rohmann over 2 years ago

Updated by Christian Rohmann over 2 years ago

Updated by geng jichao over 2 years ago

Updated by Christian Rohmann over 2 years ago

Updated by Christian Rohmann over 2 years ago

Updated by Christian Rohmann over 2 years ago

Updated by Christian Rohmann almost 2 years ago

Updated by Casey Bodley almost 2 years ago

Updated by Backport Bot almost 2 years ago

Updated by Backport Bot almost 2 years ago

Updated by Backport Bot almost 2 years ago

Updated by Christian Rohmann almost 2 years ago

Updated by Backport Bot over 1 year ago

Updated by Jonathan Bayer over 1 year ago

Updated by mahnoosh shahidi 12 months ago

Updated by Casey Bodley 12 months ago

Updated by Konstantin Shalygin 6 months ago