Bug #46563: Metadata synchronization failed，"metadata is behind on 1 shards" appear - rgw - Ceph

Actions

Copy link

Bug #46563

closed

Metadata synchronization failed，"metadata is behind on 1 shards" appear

Added by geng jichao almost 4 years ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Casey Bodley

Target version:

% Done:

100%

Source:

Development

Tags:

multisite backport_processed

Backport:

octopus pacific quincy

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v12.2.12

ceph-qa-suite:

Pull request ID:

46148

Crash signature (v1):

Crash signature (v2):

Description

I have two clusters,opened multisite.The master cluster is 3,7,13,the secondary cluster is 17,18,19.
Run "radosgs-admin sync status" on the secondary clusters, it print:

[root@TX-97-140-18 ceph]# radosgw-admin sync status
realm e39a7384-6dde-4fd9-aac0-a6eb58ec7f97 (realm_sync)
zonegroup 8cd3544a-6fa3-44d2-af7e-6d46da58f2a4 (zonegroup_sync)
zone a942cd1d-9152-443a-9942-a0de0b519979 (zone_sync2)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is behind on 1 shards
behind shards: [56]
oldest incremental change not applied: 2020-07-15 09:47:55.0.413192s
data sync source: bd57ecf2-e985-4446-b544-44f07fbe0a44 (zone_sync1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

it looks like something wrong with shard 56, run "radosgw-admin metadata sync status" on the secondary cluster,
_ {
"key": 56,
"val": {
"state": 1,
"marker": "1_1594730086.231149_132.1",
"next_step_marker": "",
"total_entries": 0,
"pos": 0,
"timestamp": "2020-07-14 12:34:46.231149Z",
"realm_epoch": 2
}
},_

than I open the secondary rgw log, the key print is as follows:

2020-07-15 20:32:47.839164 7f9de911f700 1 ====== starting new request req=0x7f9de9118f90 =====
2020-07-15 20:32:48.093578 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093581 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555591c71800:18RGWMetaSyncShardCR: operate()
2020-07-15 20:32:48.093583 7f9effb4c700 20 meta sync: incremental_sync:1722: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 max_marker= sync_marker.marker= period_marker=
2020-07-15 20:32:48.093587 7f9effb4c700 20 meta sync: incremental_sync:1656: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker= period_marker=
2020-07-15 20:32:48.093589 7f9effb4c700 20 meta sync: incremental_sync:1677: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker=
2020-07-15 20:32:48.093595 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093596 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093599 7f9effb4c700 20 enqueued request req=0x555597f34540
2020-07-15 20:32:48.093600 7f9effb4c700 20 RGWWQ:
2020-07-15 20:32:48.093601 7f9effb4c700 20 req: 0x555597f34540
2020-07-15 20:32:48.093605 7f9effb4c700 20 run: stack=0x5555918ba8c0 is io blocked
2020-07-15 20:32:48.093608 7f9f0eb6a700 20 dequeued request req=0x555597f34540
2020-07-15 20:32:48.093611 7f9f0eb6a700 20 RGWWQ: empty
2020-07-15 20:32:48.093925 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093929 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093931 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093932 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093934 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555591c71800:18RGWMetaSyncShardCR: operate()
2020-07-15 20:32:48.093936 7f9effb4c700 20 meta sync: incremental_sync:1722: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 max_marker= sync_marker.marker= period_marker=
2020-07-15 20:32:48.093938 7f9effb4c700 20 meta sync: incremental_sync:1656: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker= period_marker=
2020-07-15 20:32:48.093940 7f9effb4c700 20 meta sync: incremental_sync:1677: shard_id=56 mdlog_marker=1_1594730086.231149_132.1 sync_marker.marker=
2020-07-15 20:32:48.093945 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093946 7f9effb4c700 20 cr:s=0x5555918ba8c0:op=0x555598d78600:21RGWReadMDLogEntriesCR: operate()
2020-07-15 20:32:48.093949 7f9effb4c700 20 enqueued request req=0x555597f34540

it can be seen from, the secondary cluster can not read more log entry from the meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56,

run "rados -p zone_sync2.rgw.log listomapkeys meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the secondary cluster,it print nothing,

run "rados -p zone_sync1.rgw.log listomapkeys meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the master cluster,it print

1_1594777675.413192_15857.1
1_1594777675.417050_15858.1

run "rados -p zone_sync1.rgw.log listomapvals meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" on the master cluster,it print
1_1594777675.413192_15857.1
value (136 bytes) :
00000000 02 01 82 00 00 00 04 00 00 00 75 73 65 72 05 00 |..........user..|
00000010 00 00 75 73 65 72 35 4b 60 0e 5f c5 d1 a0 18 46 |..user5K`.....F|
00000020 00 00 00 01 01 40 00 00 00 01 01 0c 00 00 00 00 |.....@..........|
00000030 00 00 00 00 00 00 00 00 00 00 00 01 01 24 00 00 |.............$..|
00000040 00 01 00 00 00 00 00 00 00 18 00 00 00 5f 4f 74 |............._Ot|
00000050 36 32 6b 49 58 48 75 32 79 78 46 5a 65 63 45 50 |62kIXHu2yxFZecEP|
00000060 6f 52 71 78 51 01 00 00 00 1b 00 00 00 31 5f 31 |oRqxQ........1_1|
00000070 35 39 34 37 37 37 36 37 35 2e 34 31 33 31 39 32 |594777675.413192|
00000080 5f 31 35 38 35 37 2e 31 |_15857.1|
00000088

1_1594777675.417050_15858.1
value (136 bytes) :
00000000 02 01 82 00 00 00 04 00 00 00 75 73 65 72 05 00 |..........user..|
00000010 00 00 75 73 65 72 35 4b 60 0e 5f 34 b1 db 18 46 |..user5K`._4...F|
00000020 00 00 00 01 01 40 00 00 00 01 01 0c 00 00 00 00 |.....@..........|
00000030 00 00 00 00 00 00 00 00 00 00 00 01 01 24 00 00 |.............$..|
00000040 00 01 00 00 00 00 00 00 00 18 00 00 00 5f 4f 74 |............._Ot|
00000050 36 32 6b 49 58 48 75 32 79 78 46 5a 65 63 45 50 |62kIXHu2yxFZecEP|
00000060 6f 52 71 78 51 04 00 00 00 1b 00 00 00 31 5f 31 |oRqxQ........1_1|
00000070 35 39 34 37 37 37 36 37 35 2e 34 31 37 30 35 30 |594777675.417050|
00000080 5f 31 35 38 35 38 2e 31 |_15858.1|
00000088

run "radosgs-admin user list" on secondary cluster,it print
[
"user6",
"user2",
"jrsssys",
"sync-user",
"user3",
"user1",
"user4"
]

run "radosgs-admin user list" on master cluster,it print
[
"user6",
"user2",
"jrsssys",
"sync-user",
"user3",
"user1",
"user4",
"user5"
],_

the secondary clusters miss user5, the master cluster have record the user5`s log entry,but secondary cluster donot fetch the log entry from master cluster,
analysis secondary cluster rgw log, the "meta.log.67cff782-afd4-4391-a194-d6ddbd468bf6.56" shard donot have any log entry, it should
read more log entry from the master cluster,but it donot read,because the max_marker is empty, it can't execute RGWCloneMetaLogCoroutine,
the code is as follows

_if (mdlog_marker <= max_marker) { /* we're at the tip, try to bring more entries */ ldout(sync_env->cct, 20) << __func__ << ":" << __LINE__ << ": shard_id=" << shard_id << " syncing mdlog for shard_id=" << shard_id << dendl; yield call(new RGWCloneMetaLogCoroutine(sync_env, mdlog, period, shard_id, mdlog_marker, &mdlog_marker)); }_

In order to solve this bug, the modified code is as follows
_if (mdlog_marker <= max_marker ||!truncated) { /* we're at the tip, try to bring more entries */ ldout(sync_env->cct, 20) << __func__ << ":" << __LINE__ << ": shard_id=" << shard_id << " syncing mdlog for shard_id=" << shard_id << dendl; yield call(new RGWCloneMetaLogCoroutine(sync_env, mdlog, period, shard_id, mdlog_marker, &mdlog_marker)); }_

when truncated is false, it means that all local log entry have been processed, it should start reading the master cluster log.

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Assignee set to Casey Bodley

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Backport set to nautilus octopus

Actions

Copy link

Updated by Casey Bodley almost 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 36202

Actions

Copy link

Updated by Christian Rohmann over 2 years ago

I just hit this very issue on two Octopus (15.2.15) clusters running RADOSGWs in multi-site sync.

The PR #36202 is marked as stale and not yet merged.
Is there any chance this issue / the proposed fix could be picked up again?

Actions

Copy link

Updated by Christian Rohmann over 2 years ago

Maybe this is also related / a duplicate of https://tracker.ceph.com/issues/24523?

Actions

Copy link

Updated by geng jichao over 2 years ago

I submitted PR a year ago, but I don't know why no one has reviewed it. Later, it expired and the problem hasn't been closed. This code has been running in our own production environment for a long time, and it works very well。

Actions

Copy link

Updated by Christian Rohmann over 2 years ago

Thanks for getting back to me.

geng jichao wrote:

I submitted PR a year ago, but I don't know why no one has reviewed it. Later, it expired and the problem hasn't been closed. This code has been running in our own production environment for a long time, and it works very well。

Yeah, that's why I replied to the PR (https://github.com/ceph/ceph/pull/36202#issuecomment-961365607) as well.
I was able to recover from the stuck shard by stopping ALL of the RADOSGW instances on secondary and by triggering a radosgw-admin metadata init sync and then restarting them.

Before that I waited for many hours for that one shard to not be behind anymore ... but as I said, I believe I hit your issue.

Actions

Copy link

Updated by Christian Rohmann over 2 years ago

geng jichao wrote:

I submitted PR a year ago, but I don't know why no one has reviewed it. Later, it expired and the problem hasn't been closed. This code has been running in our own production environment for a long time, and it works very well。

See my comment at https://github.com/ceph/ceph/pull/36202#issuecomment-961699377 . I just triggered jenkins about a retest, let's see if anybody is keen enough to take a look. This fix looks really small, would be a shame to not pursue it further, if (still) valid.

What version of Ceph are you currently running? The ticket says 12.2.12, maybe that's why it went out of scope for review.
But I am running 15.2.xx and was still seeing the issue.

Actions

Copy link

Updated by Christian Rohmann about 2 years ago

I now observed a crash of a RADOSGW daemon:

# ceph crash info 2022-02-01T10:00:49.688998Z_d53bca83-cf40-4e07-a5f1-b26a0f0ba9f5                                                                                                                        
{                                                       
    "assert_condition": "prev_iter != pos_to_prev.end()",                                                       
    "assert_file": "/build/ceph-15.2.15/src/rgw/rgw_sync.cc",                                                   
    "assert_func": "void RGWMetaSyncShardCR::collect_children()",                                               
    "assert_line": 1514,                                
    "assert_msg": "/build/ceph-15.2.15/src/rgw/rgw_sync.cc: In function 'void RGWMetaSyncShardCR::collect_children()' thread 7f98c0ff9700 time 2022-02-01T10:00:49.676259+0000\n/build/ceph-15.2.15/src/rgw/rgw_sync.cc: 1514: F
AILED ceph_assert(prev_iter != pos_to_prev.end())\n",   
    "assert_thread_name": "meta-sync",                  
    "backtrace": [                                                                                              
        "(()+0x46210) [0x7f998f050210]",                
        "(gsignal()+0xcb) [0x7f998f05018b]",            
        "(abort()+0x12b) [0x7f998f02f859]",                                                                                                                                                                                     
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x7f998625fc3c]",                                                                                                                        
        "(()+0x26ade9) [0x7f998625fde9]",               
        "(RGWMetaSyncShardCR::collect_children()+0x427) [0x7f998f911977]",                                      
        "(RGWMetaSyncShardCR::incremental_sync()+0x9fd) [0x7f998f914f7d]",                                      
        "(RGWMetaSyncShardCR::operate()+0x46) [0x7f998f9172c6]",                                                
        "(RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x67) [0x7f998fa11677]",                               
        "(RGWCoroutinesManager::run(std::__cxx11::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x29a) [0x7f998fa12bca]",                                                                                   
        "(RGWCoroutinesManager::run(RGWCoroutine*)+0x8b) [0x7f998fa147cb]",                                     
        "(RGWRemoteMetaLog::run_sync()+0x148b) [0x7f998f8fea8b]",                                               
        "(RGWMetaSyncProcessorThread::process()+0x11) [0x7f998faffd21]",                                        
        "(RGWRadosThread::Worker::entry()+0x1ad) [0x7f998fad780d]",                                             
        "(()+0x9609) [0x7f9985ec0609]",                 
        "(clone()+0x43) [0x7f998f12c293]"               
    ],                                                  
    "ceph_version": "15.2.15",                          
    "crash_id": "2022-02-01T10:00:49.688998Z_d53bca83-cf40-4e07-a5f1-b26a0f0ba9f5",                             
    "entity_name": "client.rgw.redacted",        
    "os_id": "ubuntu",                                  
    "os_name": "Ubuntu",                                
    "os_version": "20.04.3 LTS (Focal Fossa)",          
    "os_version_id": "20.04",                           
    "process_name": "radosgw",                          
    "stack_sig": "59ff8a76d85ddb403b1bcbd8da016c2ccef42b7f0a3f73b8326218bdd9e80787",                            
    "timestamp": "2022-02-01T10:00:49.688998Z",         
    "utsname_hostname": "REDACTED",              
    "utsname_machine": "x86_64",                        
    "utsname_release": "5.13.0-28-generic",             
    "utsname_sysname": "Linux",                         
    "utsname_version": "#31~20.04.1-Ubuntu SMP Wed Jan 19 14:08:10 UTC 2022"                                    
}

Searching for some context on the code position of the crash I found some pointer to old issues in Luminous https://tracker.ceph.com/issues/20906 and a commit trying to fix it https://github.com/ceph/ceph/pull/17024.

Actions

Copy link

#10

Updated by Christian Rohmann almost 2 years ago

There is a PR supposedly fixing this issue: https://github.com/ceph/ceph/pull/46148

Actions

Copy link

#11

Updated by Casey Bodley almost 2 years ago

Status changed from Fix Under Review to Pending Backport
Backport changed from nautilus octopus to octopus pacific quincy
Pull request ID changed from 36202 to 46148

Actions

Copy link

#12

Updated by Backport Bot almost 2 years ago

Copied to Backport #55702: pacific: Metadata synchronization failed，"metadata is behind on 1 shards" appear added

Actions

Copy link

#13

Updated by Backport Bot almost 2 years ago

Copied to Backport #55703: quincy: Metadata synchronization failed，"metadata is behind on 1 shards" appear added

Actions

Copy link

#14

Updated by Backport Bot almost 2 years ago

Copied to Backport #55704: octopus: Metadata synchronization failed，"metadata is behind on 1 shards" appear added

Actions

Copy link

#15

Updated by Christian Rohmann almost 2 years ago

Casey, are those backports happening for the next point release of e.g. Pacific by any chance? We still run into this issue on (most) of our radosgw restarts and/or node reboots.
I image more "cloud native" deployments like Rook would suffer big time from this and constantly have to init the metadata sync again.

Actions

Copy link

#16

Updated by Backport Bot over 1 year ago

Tags changed from multisite to multisite backport_processed

Actions

Copy link

#17

Updated by Jonathan Bayer about 1 year ago

Is there a plan to backport this fix (https://github.com/ceph/ceph/pull/46148/files) to Quincy? I see that it was merged to master but hasn't been in a release yet. This bug is affecting our cluster as well.

Actions

Copy link

#18