Bug #22908
open[Multisite] Synchronization works only one way (zone2->zone1)
0%
Description
I have noticed that synchronization stopped working for some reason (but not fully - let me explain):
Everything was ok till 31.01.2018:
➜ ~ s3cmd -c zone1 ls s3://<bucket name>/2018/01/30/20/ | wc -l
18
➜ ~ s3cmd -c zone2 ls s3://<bucket name>/2018/01/30/20/ | wc -l
18
➜ ~
And after that:
➜ ~ s3cmd -c zone1 ls s3://<bucket name>/2018/01/30/21/ | wc -l
18
➜ ~ s3cmd -c zone2 ls s3://<bucket name>/2018/01/30/21/ | wc -l
12
➜ ~
I have the name of the DC in the filename (from where data come from):
Zone1 - master:
2018-01-30 20:15 1757117 s3://<bucket name>/2018/01/30/21/2356202233201122-52891-v1-zone1
2018-01-30 20:16 1755338 s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31 1795243 s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16 1766473 s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31 1835095 s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16 1749582 s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47 1740989 s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31 1841696 s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:47 1713001 s3://<bucket name>/2018/01/30/21/2358276846382849-50000-v1-zone1
2018-01-30 20:31 1792212 s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 21:03 1430706 s3://<bucket name>/2018/01/30/21/2359008017818455-42080-v1-zone1
2018-01-30 20:47 1725195 s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 21:03 1443962 s3://<bucket name>/2018/01/30/21/2359211503351068-41784-v1-zone1
2018-01-30 20:47 1747334 s3://<bucket name>/2018/01/30/21/2359418738089062-52037-v1-zone1
2018-01-30 20:35 2556 s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:03 1425118 s3://<bucket name>/2018/01/30/21/2359956779752022-41868-v1-zone1
2018-01-30 21:03 1431091 s3://<bucket name>/2018/01/30/21/2360352785119795-42209-v1-zone1
2018-01-30 21:20 2564 s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2
Zone2 - secondary
2018-01-30 20:16 1755338 s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31 1795243 s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16 1766473 s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31 1835095 s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16 1749582 s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47 1740989 s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31 1841696 s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:31 1792212 s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 20:47 1725195 s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 20:35 2556 s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:20 2564 s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2
So that means that I dont have a few files from the zone1 in the zone2. After that date I am not able to see any files from the zone1 in the zone2:
zone2:
2018-01-30 21:38 2594 s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:11 2480 s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2
zone1:
2018-01-30 21:15 1525212 s3://<bucket name>/2018/01/30/22/2359792201857100-44183-v1-zone1
2018-01-30 21:15 1581953 s3://<bucket name>/2018/01/30/22/2359995487978309-46588-v1-zone1
2018-01-30 21:31 1459499 s3://<bucket name>/2018/01/30/22/2360726266479292-43200-v1-zone1
2018-01-30 21:15 1529054 s3://<bucket name>/2018/01/30/22/2360740758774808-45008-v1-zone1
2018-01-30 21:31 1483060 s3://<bucket name>/2018/01/30/22/2360929541234751-44088-v1-zone1
2018-01-30 21:15 1528468 s3://<bucket name>/2018/01/30/22/2361136711431588-45084-v1-zone1
2018-01-30 21:47 1322918 s3://<bucket name>/2018/01/30/22/2361661248467302-39156-v1-zone1
2018-01-30 21:31 1459381 s3://<bucket name>/2018/01/30/22/2361674440853750-43447-v1-zone1
2018-01-30 21:47 1330364 s3://<bucket name>/2018/01/30/22/2361863632474932-39708-v1-zone1
2018-01-30 21:31 1447952 s3://<bucket name>/2018/01/30/22/2362070303351222-42168-v1-zone1
2018-01-30 22:02 964967 s3://<bucket name>/2018/01/30/22/2362596483938629-29084-v1-zone1
2018-01-30 21:47 1323983 s3://<bucket name>/2018/01/30/22/2362608066604117-38788-v1-zone1
2018-01-30 22:02 1011242 s3://<bucket name>/2018/01/30/22/2362796736684101-31161-v1-zone1
2018-01-30 21:47 1312808 s3://<bucket name>/2018/01/30/22/2363003556814073-38029-v1-zone1
2018-01-30 21:38 2594 s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:03 985933 s3://<bucket name>/2018/01/30/22/2363541027813764-30649-v1-zone1
2018-01-30 22:02 1005303 s3://<bucket name>/2018/01/30/22/2363936616223993-30624-v1-zone1
2018-01-30 22:11 2480 s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2
Status is a little bit wired:
zone2 -> zone1 = OK
zone1 -> zone2 = NOT OK
If we take a look at the synchronization status:
zone1:
realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
zone 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
metadata sync no sync (zone is master)
data sync source: 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
zone2:
realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
zone 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
oldest incremental change not applied: 2018-02-03 09:37:03.0.544123s
This is my zone conf:
{
"id": "4134640c-d16b-4166-bbd6-987637da469d",
"name": "platform",
"api_name": "platform",
"is_master": "true",
"endpoints": [
"https://<URL>:443"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
"zones": [
{
"id": "6328c6d7-31a5-4d42-8359-1e28689572da",
"name": "zone2",
"endpoints": [
"https://<URL>:443"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
},
{
"id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
"name": "zone2",
"endpoints": [
"https://<URL>:443"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": []
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": []
}
],
"default_placement": "default-placement",
"realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2"
}
Could someone put some light what can be wrong here? Based on the status information is pretty hard to maintain this env. I have to counts the files on both sites to make sure that everything is ok because I can't belive the status information.
And another thing - what is the best way to solve it ? Should I execute sync init --bucket=<bucket name> ?
Updated by Mariusz Derela about 6 years ago
That information about 1 shard behind - it is ok. My ingestion to s3 is quite big and this info appears sometimes and after a few minutes is again up2date (but only on the status...not really).
In the logs I can't see anything special. Some errors related with the mdlog (can't find diretory):
2018-02-03 10:00:08.289419 7fd06a0f5700 1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
Updated by Mariusz Derela about 6 years ago
I made one mistake in the zone cfg (I have tried to "mask" a few fields). This is the proper config:
{ "id": "4134640c-d16b-4166-bbd6-987637da469d", "name": "prd", "api_name": "prd", "is_master": "true", "endpoints": [ "https://<URL>:443" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f", "zones": [ { "id": "6328c6d7-31a5-4d42-8359-1e28689572da", "name": "zone2", "endpoints": [ "https://<URL>:443" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [] }, { "id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f", "name": "zone1", "endpoints": [ "https://<URL>:443" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [] } ], "placement_targets": [ { "name": "default-placement", "tags": [] } ], "default_placement": "default-placement", "realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2" }
Updated by Orit Wasserman about 6 years ago
can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.
Updated by Mariusz Derela about 6 years ago
Orit Wasserman wrote:
can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.
Hi, thanks for reply.
radosgw-admin metadata sync error list | grep 'message":' | sort | uniq -c 13 "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 29967 "message": "failed to sync bucket instance: (16) Device or resource busy" 62 "message": "failed to sync bucket instance: (5) Input/output error" 1958 "message": "failed to sync object" radosgw-admin data sync error list | grep 'message":' | sort | uniq -c 13 "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 29967 "message": "failed to sync bucket instance: (16) Device or resource busy" 62 "message": "failed to sync bucket instance: (5) Input/output error" 1958 "message": "failed to sync object"
That input/output error is probably restarting of my rgw. Error with "failed to sync object"- that was my mistake. In our previously flow (ingestion to s3) there was a small issue realted with the "dotfiles" (creating first dot files like ".test" and after that moving in to "test").
Right now we have a different issue.. I made "data init" on of the bucket and after that we have got this:
radosgw-admin metadata sync status --source-zone=zone2 { "sync_status": { "info": { "status": "init", "num_shards": 0, "period": "", "realm_epoch": 0 }, "markers": [] }, "full_sync": { "total": 0, "complete": 0 } } radosgw-admin metadata sync status --source-zone=zone1 { "sync_status": { "info": { "status": "building-full-sync-maps", "num_shards": 0, "period": "", "realm_epoch": 0 }, "markers": [] }, "full_sync": { "total": 0, "complete": 0 } }
And when we starting rgw on zone2:
2018-02-18 12:21:35.894747 7fe6320a3e00 0 deferred set uid:gid to 167:167 (ceph:ceph) 2018-02-18 12:21:35.895358 7fe6320a3e00 0 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 716929 2018-02-18 12:21:36.219298 7fe6320a3e00 0 starting handler: civetweb 2018-02-18 12:21:36.245297 7fe6320a3e00 1 mgrc service_daemon_register rgw.node19 metadata {arch=x86_64,ceph_version=ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),cpu=Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz,distro=rhel,distro_description=Red Hat Enterprise Linux Server 7.4 (Maipo),distro_version=7.4,frontend_config#0=civetweb port=443s ssl_certificate=/etc/pki/tls/ca.pem,frontend_type#0=civetweb,hostname=node19,kernel_description=#1 SMP Fri Oct 13 10:46:25 EDT 2017,kernel_version=3.10.0-693.5.2.el7.x86_64,mem_swap_kb=12582904,mem_total_kb=12139612,num_handles=1,os=Linux,pid=716929,zone_id=6328c6d7-31a5-4d42-8359-1e28689572da,zone_name=zone2,zonegroup_id=4134640c-d16b-4166-bbd6-987637da469d,zonegroup_name=prd} 2018-02-18 12:21:36.385918 7fe61aa68700 1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory 2018-02-18 12:21:36.385952 7fe61aa68700 1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory (...) 2018-02-18 12:21:36.497936 7fe5fd988700 1 ====== starting new request req=0x7fe5fd982190 ===== 2018-02-18 12:21:36.560816 7fe60623f700 -1 *** Caught signal (Segmentation fault) ** in thread 7fe60623f700 thread_name:data-sync ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable) 1: (()+0x20a9c1) [0x56540cc809c1] 2: (()+0xf5e0) [0x7fe630dd25e0] 3: (RGWListBucketIndexesCR::operate()+0xd3b) [0x56540cf19afb] 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x56540cd0edae] 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3eb) [0x56540cd1174b] 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x56540cd12490] 7: (RGWRemoteDataLog::run_sync(int)+0xe4) [0x56540cf010c4] 8: (RGWDataSyncProcessorThread::process()+0x46) [0x56540cdc9d76] 9: (RGWRadosThread::Worker::entry()+0x123) [0x56540cd630e3] 10: (()+0x7e25) [0x7fe630dcae25] 11: (clone()+0x6d) [0x7fe62595f34d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 --- begin dump of recent events --- -1754> 2018-02-18 12:21:35.880791 7fe6320a3e00 5 asok(0x56540e3e81c0) register_command perfcounters_dump hook 0x56540e39c060
I can start rgw on zone2 - only when I disable connection to zone1 on iptables.
Updated by Mariusz Derela about 6 years ago
one mistake - this is a result of : radosgw-admin data sync status --source-zone=zone1 instead of metadata.
Updated by Abhishek Lekshmanan about 6 years ago
There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?
Updated by Mariusz Derela about 6 years ago
Abhishek Lekshmanan wrote:
There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?
problem seems to be solved (no more segmentation faults error) after upgrading to 12.2.4. Synchronization is still not progressing - probably cluster is trying to synchronize metadata.
Updated by Yehuda Sadeh about 6 years ago
Now that the old bug is solved, try to re-sync missed objects by restarting sync on the specific buckets using:
$ radosgw-admin bucket sync disable --bucket=<bucket> $ radosgw-admin bucket sync enable --bucket=<bucket>
Updated by Yehuda Sadeh about 6 years ago
- Status changed from New to Need More Info