Project

General

Profile

Actions

Bug #22908

open

[Multisite] Synchronization works only one way (zone2->zone1)

Added by Mariusz Derela about 6 years ago. Updated about 6 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I have noticed that synchronization stopped working for some reason (but not fully - let me explain):

Everything was ok till 31.01.2018:

➜  ~ s3cmd -c zone1 ls  s3://<bucket name>/2018/01/30/20/ | wc -l
18
➜  ~ s3cmd -c zone2 ls  s3://<bucket name>/2018/01/30/20/ | wc -l 
18
➜  ~ 

And after that:

➜  ~ s3cmd -c zone1 ls  s3://<bucket name>/2018/01/30/21/ | wc -l 
18
➜  ~ s3cmd -c zone2 ls  s3://<bucket name>/2018/01/30/21/ | wc -l 
12
➜  ~ 

I have the name of the DC in the filename (from where data come from):

Zone1 - master:

2018-01-30 20:15   1757117   s3://<bucket name>/2018/01/30/21/2356202233201122-52891-v1-zone1
2018-01-30 20:16   1755338   s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31   1795243   s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16   1766473   s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31   1835095   s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16   1749582   s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47   1740989   s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31   1841696   s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:47   1713001   s3://<bucket name>/2018/01/30/21/2358276846382849-50000-v1-zone1
2018-01-30 20:31   1792212   s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 21:03   1430706   s3://<bucket name>/2018/01/30/21/2359008017818455-42080-v1-zone1
2018-01-30 20:47   1725195   s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 21:03   1443962   s3://<bucket name>/2018/01/30/21/2359211503351068-41784-v1-zone1
2018-01-30 20:47   1747334   s3://<bucket name>/2018/01/30/21/2359418738089062-52037-v1-zone1
2018-01-30 20:35      2556   s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:03   1425118   s3://<bucket name>/2018/01/30/21/2359956779752022-41868-v1-zone1
2018-01-30 21:03   1431091   s3://<bucket name>/2018/01/30/21/2360352785119795-42209-v1-zone1
2018-01-30 21:20      2564   s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2

Zone2 - secondary

2018-01-30 20:16   1755338   s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31   1795243   s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16   1766473   s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31   1835095   s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16   1749582   s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47   1740989   s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31   1841696   s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:31   1792212   s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 20:47   1725195   s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 20:35      2556   s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:20      2564   s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2

So that means that I dont have a few files from the zone1 in the zone2. After that date I am not able to see any files from the zone1 in the zone2:

zone2:
2018-01-30 21:38      2594   s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:11      2480   s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2

zone1:
2018-01-30 21:15   1525212   s3://<bucket name>/2018/01/30/22/2359792201857100-44183-v1-zone1
2018-01-30 21:15   1581953   s3://<bucket name>/2018/01/30/22/2359995487978309-46588-v1-zone1
2018-01-30 21:31   1459499   s3://<bucket name>/2018/01/30/22/2360726266479292-43200-v1-zone1
2018-01-30 21:15   1529054   s3://<bucket name>/2018/01/30/22/2360740758774808-45008-v1-zone1
2018-01-30 21:31   1483060   s3://<bucket name>/2018/01/30/22/2360929541234751-44088-v1-zone1
2018-01-30 21:15   1528468   s3://<bucket name>/2018/01/30/22/2361136711431588-45084-v1-zone1
2018-01-30 21:47   1322918   s3://<bucket name>/2018/01/30/22/2361661248467302-39156-v1-zone1
2018-01-30 21:31   1459381   s3://<bucket name>/2018/01/30/22/2361674440853750-43447-v1-zone1
2018-01-30 21:47   1330364   s3://<bucket name>/2018/01/30/22/2361863632474932-39708-v1-zone1
2018-01-30 21:31   1447952   s3://<bucket name>/2018/01/30/22/2362070303351222-42168-v1-zone1
2018-01-30 22:02    964967   s3://<bucket name>/2018/01/30/22/2362596483938629-29084-v1-zone1
2018-01-30 21:47   1323983   s3://<bucket name>/2018/01/30/22/2362608066604117-38788-v1-zone1
2018-01-30 22:02   1011242   s3://<bucket name>/2018/01/30/22/2362796736684101-31161-v1-zone1
2018-01-30 21:47   1312808   s3://<bucket name>/2018/01/30/22/2363003556814073-38029-v1-zone1
2018-01-30 21:38      2594   s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:03    985933   s3://<bucket name>/2018/01/30/22/2363541027813764-30649-v1-zone1
2018-01-30 22:02   1005303   s3://<bucket name>/2018/01/30/22/2363936616223993-30624-v1-zone1
2018-01-30 22:11      2480   s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2

Status is a little bit wired:
zone2 -> zone1 = OK
zone1 -> zone2 = NOT OK

If we take a look at the synchronization status:
zone1:

          realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
      zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
           zone 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
  metadata sync no sync (zone is master)
      data sync source: 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards

zone2:

          realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
      zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
           zone 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2018-02-03 09:37:03.0.544123s

This is my zone conf:

{
    "id": "4134640c-d16b-4166-bbd6-987637da469d",
    "name": "platform",
    "api_name": "platform",
    "is_master": "true",
    "endpoints": [
        "https://<URL>:443" 
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
    "zones": [
        {
            "id": "6328c6d7-31a5-4d42-8359-1e28689572da",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        },
        {
            "id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2" 
}

Could someone put some light what can be wrong here? Based on the status information is pretty hard to maintain this env. I have to counts the files on both sites to make sure that everything is ok because I can't belive the status information.

And another thing - what is the best way to solve it ? Should I execute sync init --bucket=<bucket name> ?

Actions #1

Updated by Mariusz Derela about 6 years ago

That information about 1 shard behind - it is ok. My ingestion to s3 is quite big and this info appears sometimes and after a few minutes is again up2date (but only on the status...not really).

In the logs I can't see anything special. Some errors related with the mdlog (can't find diretory):

2018-02-03 10:00:08.289419 7fd06a0f5700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
Actions #2

Updated by Mariusz Derela about 6 years ago

I made one mistake in the zone cfg (I have tried to "mask" a few fields). This is the proper config:

{
    "id": "4134640c-d16b-4166-bbd6-987637da469d",
    "name": "prd",
    "api_name": "prd",
    "is_master": "true",
    "endpoints": [
        "https://<URL>:443" 
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
    "zones": [
        {
            "id": "6328c6d7-31a5-4d42-8359-1e28689572da",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        },
        {
            "id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
            "name": "zone1",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2" 
}
Actions #3

Updated by Orit Wasserman about 6 years ago

can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.

Actions #4

Updated by Mariusz Derela about 6 years ago

Orit Wasserman wrote:

can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.

Hi, thanks for reply.

radosgw-admin metadata sync error list | grep 'message":' | sort | uniq -c
     13                     "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 
  29967                     "message": "failed to sync bucket instance: (16) Device or resource busy" 
     62                     "message": "failed to sync bucket instance: (5) Input/output error" 
   1958                     "message": "failed to sync object" 

radosgw-admin data sync error list | grep 'message":' | sort | uniq -c
     13                     "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 
  29967                     "message": "failed to sync bucket instance: (16) Device or resource busy" 
     62                     "message": "failed to sync bucket instance: (5) Input/output error" 
   1958                     "message": "failed to sync object" 

That input/output error is probably restarting of my rgw. Error with "failed to sync object"- that was my mistake. In our previously flow (ingestion to s3) there was a small issue realted with the "dotfiles" (creating first dot files like ".test" and after that moving in to "test").

Right now we have a different issue.. I made "data init" on of the bucket and after that we have got this:


 radosgw-admin metadata sync status --source-zone=zone2
{
    "sync_status": {
        "info": {
            "status": "init",
            "num_shards": 0,
            "period": "",
            "realm_epoch": 0
        },
        "markers": []
    },
    "full_sync": {
        "total": 0,
        "complete": 0
    }
}

 radosgw-admin metadata sync status --source-zone=zone1
{
    "sync_status": {
        "info": {
            "status": "building-full-sync-maps",
            "num_shards": 0,
            "period": "",
            "realm_epoch": 0
        },
        "markers": []
    },
    "full_sync": {
        "total": 0,
        "complete": 0
    }
}

And when we starting rgw on zone2:

2018-02-18 12:21:35.894747 7fe6320a3e00  0 deferred set uid:gid to 167:167 (ceph:ceph)
2018-02-18 12:21:35.895358 7fe6320a3e00  0 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 716929
2018-02-18 12:21:36.219298 7fe6320a3e00  0 starting handler: civetweb
2018-02-18 12:21:36.245297 7fe6320a3e00  1 mgrc service_daemon_register rgw.node19 metadata {arch=x86_64,ceph_version=ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),cpu=Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz,distro=rhel,distro_description=Red Hat Enterprise Linux Server 7.4 (Maipo),distro_version=7.4,frontend_config#0=civetweb port=443s ssl_certificate=/etc/pki/tls/ca.pem,frontend_type#0=civetweb,hostname=node19,kernel_description=#1 SMP Fri Oct 13 10:46:25 EDT 2017,kernel_version=3.10.0-693.5.2.el7.x86_64,mem_swap_kb=12582904,mem_total_kb=12139612,num_handles=1,os=Linux,pid=716929,zone_id=6328c6d7-31a5-4d42-8359-1e28689572da,zone_name=zone2,zonegroup_id=4134640c-d16b-4166-bbd6-987637da469d,zonegroup_name=prd}
2018-02-18 12:21:36.385918 7fe61aa68700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
2018-02-18 12:21:36.385952 7fe61aa68700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
(...)
2018-02-18 12:21:36.497936 7fe5fd988700  1 ====== starting new request req=0x7fe5fd982190 =====
2018-02-18 12:21:36.560816 7fe60623f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fe60623f700 thread_name:data-sync

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0x20a9c1) [0x56540cc809c1]
 2: (()+0xf5e0) [0x7fe630dd25e0]
 3: (RGWListBucketIndexesCR::operate()+0xd3b) [0x56540cf19afb]
 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x56540cd0edae]
 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3eb) [0x56540cd1174b]
 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x56540cd12490]
 7: (RGWRemoteDataLog::run_sync(int)+0xe4) [0x56540cf010c4]
 8: (RGWDataSyncProcessorThread::process()+0x46) [0x56540cdc9d76]
 9: (RGWRadosThread::Worker::entry()+0x123) [0x56540cd630e3]
 10: (()+0x7e25) [0x7fe630dcae25]
 11: (clone()+0x6d) [0x7fe62595f34d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000

--- begin dump of recent events ---
 -1754> 2018-02-18 12:21:35.880791 7fe6320a3e00  5 asok(0x56540e3e81c0) register_command perfcounters_dump hook 0x56540e39c060

I can start rgw on zone2 - only when I disable connection to zone1 on iptables.

Actions #5

Updated by Mariusz Derela about 6 years ago

one mistake - this is a result of : radosgw-admin data sync status --source-zone=zone1 instead of metadata.

Actions #6

Updated by Abhishek Lekshmanan about 6 years ago

There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?

Actions #7

Updated by Mariusz Derela about 6 years ago

Abhishek Lekshmanan wrote:

There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?

problem seems to be solved (no more segmentation faults error) after upgrading to 12.2.4. Synchronization is still not progressing - probably cluster is trying to synchronize metadata.

Actions #8

Updated by Yehuda Sadeh about 6 years ago

Now that the old bug is solved, try to re-sync missed objects by restarting sync on the specific buckets using:

$ radosgw-admin bucket sync disable --bucket=<bucket>
$ radosgw-admin bucket sync enable --bucket=<bucket>

Actions #9

Updated by Yehuda Sadeh about 6 years ago

  • Status changed from New to Need More Info
Actions #10

Updated by Yehuda Sadeh about 6 years ago

@mariusz did that solve it for you?

Actions

Also available in: Atom PDF