Bug #22908: [Multisite] Synchronization works only one way (zone2->zone1) - rgw - Ceph

Actions

Copy link

Bug #22908

open

[Multisite] Synchronization works only one way (zone2->zone1)

Added by Mariusz Derela about 6 years ago. Updated about 6 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Target version:

Ceph - v12.2.2

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.2.2

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have noticed that synchronization stopped working for some reason (but not fully - let me explain):

Everything was ok till 31.01.2018:

➜  ~ s3cmd -c zone1 ls  s3://<bucket name>/2018/01/30/20/ | wc -l
18
➜  ~ s3cmd -c zone2 ls  s3://<bucket name>/2018/01/30/20/ | wc -l 
18
➜  ~

And after that:

➜  ~ s3cmd -c zone1 ls  s3://<bucket name>/2018/01/30/21/ | wc -l 
18
➜  ~ s3cmd -c zone2 ls  s3://<bucket name>/2018/01/30/21/ | wc -l 
12
➜  ~

I have the name of the DC in the filename (from where data come from):

Zone1 - master:

2018-01-30 20:15   1757117   s3://<bucket name>/2018/01/30/21/2356202233201122-52891-v1-zone1
2018-01-30 20:16   1755338   s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31   1795243   s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16   1766473   s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31   1835095   s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16   1749582   s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47   1740989   s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31   1841696   s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:47   1713001   s3://<bucket name>/2018/01/30/21/2358276846382849-50000-v1-zone1
2018-01-30 20:31   1792212   s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 21:03   1430706   s3://<bucket name>/2018/01/30/21/2359008017818455-42080-v1-zone1
2018-01-30 20:47   1725195   s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 21:03   1443962   s3://<bucket name>/2018/01/30/21/2359211503351068-41784-v1-zone1
2018-01-30 20:47   1747334   s3://<bucket name>/2018/01/30/21/2359418738089062-52037-v1-zone1
2018-01-30 20:35      2556   s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:03   1425118   s3://<bucket name>/2018/01/30/21/2359956779752022-41868-v1-zone1
2018-01-30 21:03   1431091   s3://<bucket name>/2018/01/30/21/2360352785119795-42209-v1-zone1
2018-01-30 21:20      2564   s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2

Zone2 - secondary

2018-01-30 20:16   1755338   s3://<bucket name>/2018/01/30/21/2356407377147077-51725-v1-zone1
2018-01-30 20:31   1795243   s3://<bucket name>/2018/01/30/21/2357138004184386-52607-v1-zone1
2018-01-30 20:16   1766473   s3://<bucket name>/2018/01/30/21/2357153301329742-52479-v1-zone1
2018-01-30 20:31   1835095   s3://<bucket name>/2018/01/30/21/2357342194418114-53850-v1-zone1
2018-01-30 20:16   1749582   s3://<bucket name>/2018/01/30/21/2357549767263837-52026-v1-zone1
2018-01-30 20:47   1740989   s3://<bucket name>/2018/01/30/21/2358073001616294-51939-v1-zone1
2018-01-30 20:31   1841696   s3://<bucket name>/2018/01/30/21/2358088303417457-54688-v1-zone1
2018-01-30 20:31   1792212   s3://<bucket name>/2018/01/30/21/2358484311300704-52251-v1-zone1
2018-01-30 20:47   1725195   s3://<bucket name>/2018/01/30/21/2359022892851188-50959-v1-zone1
2018-01-30 20:35      2556   s3://<bucket name>/2018/01/30/21/2359498340525216-8-v1-zone2
2018-01-30 21:20      2564   s3://<bucket name>/2018/01/30/21/2362228740122179-3-v1-zone2

So that means that I dont have a few files from the zone1 in the zone2. After that date I am not able to see any files from the zone1 in the zone2:

zone2:
2018-01-30 21:38      2594   s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:11      2480   s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2

zone1:
2018-01-30 21:15   1525212   s3://<bucket name>/2018/01/30/22/2359792201857100-44183-v1-zone1
2018-01-30 21:15   1581953   s3://<bucket name>/2018/01/30/22/2359995487978309-46588-v1-zone1
2018-01-30 21:31   1459499   s3://<bucket name>/2018/01/30/22/2360726266479292-43200-v1-zone1
2018-01-30 21:15   1529054   s3://<bucket name>/2018/01/30/22/2360740758774808-45008-v1-zone1
2018-01-30 21:31   1483060   s3://<bucket name>/2018/01/30/22/2360929541234751-44088-v1-zone1
2018-01-30 21:15   1528468   s3://<bucket name>/2018/01/30/22/2361136711431588-45084-v1-zone1
2018-01-30 21:47   1322918   s3://<bucket name>/2018/01/30/22/2361661248467302-39156-v1-zone1
2018-01-30 21:31   1459381   s3://<bucket name>/2018/01/30/22/2361674440853750-43447-v1-zone1
2018-01-30 21:47   1330364   s3://<bucket name>/2018/01/30/22/2361863632474932-39708-v1-zone1
2018-01-30 21:31   1447952   s3://<bucket name>/2018/01/30/22/2362070303351222-42168-v1-zone1
2018-01-30 22:02    964967   s3://<bucket name>/2018/01/30/22/2362596483938629-29084-v1-zone1
2018-01-30 21:47   1323983   s3://<bucket name>/2018/01/30/22/2362608066604117-38788-v1-zone1
2018-01-30 22:02   1011242   s3://<bucket name>/2018/01/30/22/2362796736684101-31161-v1-zone1
2018-01-30 21:47   1312808   s3://<bucket name>/2018/01/30/22/2363003556814073-38029-v1-zone1
2018-01-30 21:38      2594   s3://<bucket name>/2018/01/30/22/2363278763714103-12-v1-zone2
2018-01-30 22:03    985933   s3://<bucket name>/2018/01/30/22/2363541027813764-30649-v1-zone1
2018-01-30 22:02   1005303   s3://<bucket name>/2018/01/30/22/2363936616223993-30624-v1-zone1
2018-01-30 22:11      2480   s3://<bucket name>/2018/01/30/22/2365288966899244-3-v1-zone2

Status is a little bit wired:
zone2 -> zone1 = OK
zone1 -> zone2 = NOT OK

If we take a look at the synchronization status:
zone1:

          realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
      zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
           zone 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
  metadata sync no sync (zone is master)
      data sync source: 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards

zone2:

          realm c6055c2e-5ac0-4638-851f-f1051b61d0c2 (platform)
      zonegroup 4134640c-d16b-4166-bbd6-987637da469d (prd)
           zone 6328c6d7-31a5-4d42-8359-1e28689572da (zone2)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 8adfe5fc-65df-4227-9d85-1d0d1e66ac1f (zone1)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        oldest incremental change not applied: 2018-02-03 09:37:03.0.544123s

This is my zone conf:

{
    "id": "4134640c-d16b-4166-bbd6-987637da469d",
    "name": "platform",
    "api_name": "platform",
    "is_master": "true",
    "endpoints": [
        "https://<URL>:443" 
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
    "zones": [
        {
            "id": "6328c6d7-31a5-4d42-8359-1e28689572da",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        },
        {
            "id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2" 
}

Could someone put some light what can be wrong here? Based on the status information is pretty hard to maintain this env. I have to counts the files on both sites to make sure that everything is ok because I can't belive the status information.

And another thing - what is the best way to solve it ? Should I execute sync init --bucket=<bucket name> ?

Actions

Copy link

Updated by Mariusz Derela about 6 years ago

That information about 1 shard behind - it is ok. My ingestion to s3 is quite big and this info appears sometimes and after a few minutes is again up2date (but only on the status...not really).

In the logs I can't see anything special. Some errors related with the mdlog (can't find diretory):

2018-02-03 10:00:08.289419 7fd06a0f5700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory

Actions

Copy link

Updated by Mariusz Derela about 6 years ago

I made one mistake in the zone cfg (I have tried to "mask" a few fields). This is the proper config:

{
    "id": "4134640c-d16b-4166-bbd6-987637da469d",
    "name": "prd",
    "api_name": "prd",
    "is_master": "true",
    "endpoints": [
        "https://<URL>:443" 
    ],
    "hostnames": [],
    "hostnames_s3website": [],
    "master_zone": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
    "zones": [
        {
            "id": "6328c6d7-31a5-4d42-8359-1e28689572da",
            "name": "zone2",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        },
        {
            "id": "8adfe5fc-65df-4227-9d85-1d0d1e66ac1f",
            "name": "zone1",
            "endpoints": [
                "https://<URL>:443" 
            ],
            "log_meta": "false",
            "log_data": "true",
            "bucket_index_max_shards": 0,
            "read_only": "false",
            "tier_type": "",
            "sync_from_all": "true",
            "sync_from": []
        }
    ],
    "placement_targets": [
        {
            "name": "default-placement",
            "tags": []
        }
    ],
    "default_placement": "default-placement",
    "realm_id": "c6055c2e-5ac0-4638-851f-f1051b61d0c2" 
}

Actions

Copy link

Updated by Orit Wasserman about 6 years ago

can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.

Actions

Copy link

Updated by Mariusz Derela about 6 years ago

Orit Wasserman wrote:

can you provide the out put of sync error list commands?
EBUSY errors are expected to happen, look for other errors.

Hi, thanks for reply.

radosgw-admin metadata sync error list | grep 'message":' | sort | uniq -c
     13                     "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 
  29967                     "message": "failed to sync bucket instance: (16) Device or resource busy" 
     62                     "message": "failed to sync bucket instance: (5) Input/output error" 
   1958                     "message": "failed to sync object" 

radosgw-admin data sync error list | grep 'message":' | sort | uniq -c
     13                     "message": "failed to sync bucket instance: (11) Resource temporarily unavailable" 
  29967                     "message": "failed to sync bucket instance: (16) Device or resource busy" 
     62                     "message": "failed to sync bucket instance: (5) Input/output error" 
   1958                     "message": "failed to sync object"

That input/output error is probably restarting of my rgw. Error with "failed to sync object"- that was my mistake. In our previously flow (ingestion to s3) there was a small issue realted with the "dotfiles" (creating first dot files like ".test" and after that moving in to "test").

Right now we have a different issue.. I made "data init" on of the bucket and after that we have got this:


 radosgw-admin metadata sync status --source-zone=zone2
{
    "sync_status": {
        "info": {
            "status": "init",
            "num_shards": 0,
            "period": "",
            "realm_epoch": 0
        },
        "markers": []
    },
    "full_sync": {
        "total": 0,
        "complete": 0
    }
}

 radosgw-admin metadata sync status --source-zone=zone1
{
    "sync_status": {
        "info": {
            "status": "building-full-sync-maps",
            "num_shards": 0,
            "period": "",
            "realm_epoch": 0
        },
        "markers": []
    },
    "full_sync": {
        "total": 0,
        "complete": 0
    }
}

And when we starting rgw on zone2:

2018-02-18 12:21:35.894747 7fe6320a3e00  0 deferred set uid:gid to 167:167 (ceph:ceph)
2018-02-18 12:21:35.895358 7fe6320a3e00  0 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process (unknown), pid 716929
2018-02-18 12:21:36.219298 7fe6320a3e00  0 starting handler: civetweb
2018-02-18 12:21:36.245297 7fe6320a3e00  1 mgrc service_daemon_register rgw.node19 metadata {arch=x86_64,ceph_version=ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable),cpu=Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz,distro=rhel,distro_description=Red Hat Enterprise Linux Server 7.4 (Maipo),distro_version=7.4,frontend_config#0=civetweb port=443s ssl_certificate=/etc/pki/tls/ca.pem,frontend_type#0=civetweb,hostname=node19,kernel_description=#1 SMP Fri Oct 13 10:46:25 EDT 2017,kernel_version=3.10.0-693.5.2.el7.x86_64,mem_swap_kb=12582904,mem_total_kb=12139612,num_handles=1,os=Linux,pid=716929,zone_id=6328c6d7-31a5-4d42-8359-1e28689572da,zone_name=zone2,zonegroup_id=4134640c-d16b-4166-bbd6-987637da469d,zonegroup_name=prd}
2018-02-18 12:21:36.385918 7fe61aa68700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
2018-02-18 12:21:36.385952 7fe61aa68700  1 meta sync: ERROR: failed to read mdlog info with (2) No such file or directory
(...)
2018-02-18 12:21:36.497936 7fe5fd988700  1 ====== starting new request req=0x7fe5fd982190 =====
2018-02-18 12:21:36.560816 7fe60623f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fe60623f700 thread_name:data-sync

 ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
 1: (()+0x20a9c1) [0x56540cc809c1]
 2: (()+0xf5e0) [0x7fe630dd25e0]
 3: (RGWListBucketIndexesCR::operate()+0xd3b) [0x56540cf19afb]
 4: (RGWCoroutinesStack::operate(RGWCoroutinesEnv*)+0x7e) [0x56540cd0edae]
 5: (RGWCoroutinesManager::run(std::list<RGWCoroutinesStack*, std::allocator<RGWCoroutinesStack*> >&)+0x3eb) [0x56540cd1174b]
 6: (RGWCoroutinesManager::run(RGWCoroutine*)+0x70) [0x56540cd12490]
 7: (RGWRemoteDataLog::run_sync(int)+0xe4) [0x56540cf010c4]
 8: (RGWDataSyncProcessorThread::process()+0x46) [0x56540cdc9d76]
 9: (RGWRadosThread::Worker::entry()+0x123) [0x56540cd630e3]
 10: (()+0x7e25) [0x7fe630dcae25]
 11: (clone()+0x6d) [0x7fe62595f34d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000

--- begin dump of recent events ---
 -1754> 2018-02-18 12:21:35.880791 7fe6320a3e00  5 asok(0x56540e3e81c0) register_command perfcounters_dump hook 0x56540e39c060

I can start rgw on zone2 - only when I disable connection to zone1 on iptables.

Actions

Copy link

Updated by Mariusz Derela about 6 years ago

one mistake - this is a result of : radosgw-admin data sync status --source-zone=zone1 instead of metadata.

Actions

Copy link

Updated by Abhishek Lekshmanan about 6 years ago

There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?

Actions

Copy link

Updated by Mariusz Derela about 6 years ago

Abhishek Lekshmanan wrote:

There was a fix that went in to this in 12.2.3 https://github.com/ceph/ceph/pull/19071, can you reproduce this in the latest Luminous?

problem seems to be solved (no more segmentation faults error) after upgrading to 12.2.4. Synchronization is still not progressing - probably cluster is trying to synchronize metadata.

Actions

Copy link

Updated by Yehuda Sadeh about 6 years ago

Now that the old bug is solved, try to re-sync missed objects by restarting sync on the specific buckets using:

$ radosgw-admin bucket sync disable --bucket=<bucket>
$ radosgw-admin bucket sync enable --bucket=<bucket>

Actions

Copy link

Updated by Yehuda Sadeh about 6 years ago

Status changed from New to Need More Info

Actions

Copy link

#10

Updated by Yehuda Sadeh about 6 years ago

@mariusz did that solve it for you?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #22908

[Multisite] Synchronization works only one way (zone2->zone1)

Updated by Mariusz Derela about 6 years ago

Updated by Mariusz Derela about 6 years ago

Updated by Orit Wasserman about 6 years ago

Updated by Mariusz Derela about 6 years ago

Updated by Mariusz Derela about 6 years ago

Updated by Abhishek Lekshmanan about 6 years ago

Updated by Mariusz Derela about 6 years ago

Updated by Yehuda Sadeh about 6 years ago

Updated by Yehuda Sadeh about 6 years ago

Updated by Yehuda Sadeh about 6 years ago