Project

General

Profile

Bug #12182

Data asynchronous sync failed in federated gateway

Added by Wen-Dwo Hwang Hwang over 8 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I followed the instructions of Federated Configuration (http://ceph.com/docs/master/radosgw/federated-config/) to build single region and two zones environment, which us-east is master zone and us-west is secondary zone. I have configured all settings successfully. In the final step, when the synchronization agent, 'radosgw-agent -c region-data-sync.conf' is running, I saw the agent was synchronizing shards of data.

After some testing by using 's3cmd', I found only metadata is replicated to secondary zone which is checked by using :
'rados -p .us-east.rgw.buckets.index ls' for master zone
'rados -p .us-west.rgw.buckets.index ls' for secondary zone.
The data portion was missing which is checked by using :
'rados -p .us-east.rgw.buckets ls' for master zone
'rados -p .us-west.rgw.buckets ls' for secondary zone.

I found the error message in the logging file of the 'radosgw-agent', version 1.1, that showed 'unhashable type 'list''. So I modified the /usr/share/pyshared/radosgw_agent/worker.py to add 'tuple' according to the version 1.2.2. After modified completely, then the another error message '*HttpError: Http error code 409 content {"Code":"BucketAlreadyExists"}*' occurred in the logging file.

Any help would be much appreciated.

The packages information as follows:
OS Ubuntu 14.04 AMD64
Apache2 2.4.7-1ubuntu4.4
Ceph 0.80.9-0ubuntu0.14.04.2
Ceph-deploy 1.4.0-0ubuntu1
Radosgw 0.80.9-0ubuntu0.14.04.2
Radosgw-agent 1.1-0ubuntu1

master_radosgw-sync.log View - logging file of radosgw-agent on master zone cluster (23 KB) Wen-Dwo Hwang Hwang, 07/01/2015 03:23 AM

secondary_radosgw.log View - logging file of /etc/init.d/radosgw for secondary zone cluster (55.6 KB) Wen-Dwo Hwang Hwang, 07/01/2015 03:29 AM

radosgw-sync.log View (20 KB) Wen-Dwo Hwang Hwang, 07/17/2015 02:04 AM

ceph_access.log View (5.98 KB) Wen-Dwo Hwang Hwang, 07/17/2015 02:04 AM

radosgw.node1-ms20.log View (31.2 KB) Wen-Dwo Hwang Hwang, 07/17/2015 03:04 AM

History

#1 Updated by John Spray over 8 years ago

  • Project changed from Linux kernel client to rgw

#2 Updated by Wen-Dwo Hwang Hwang over 8 years ago

By the way, I forgot to mention another error message 'ERROR:radosgw_agent.worker:failed to sync object BUCKET1/us.json: state is error' also occurred in the logging file.

#3 Updated by Yehuda Sadeh over 8 years ago

  • Assignee set to Alfredo Deza

#4 Updated by Alfredo Deza over 8 years ago

  • Status changed from New to Need More Info

Wen-Dwo can you make sure you are using the latest version of the agent? It is currently at 1.2.2 and that can be installed from the latest versions
of the Ceph repositories.

#5 Updated by Wen-Dwo Hwang Hwang over 8 years ago

I have removed the radosgw-agent of version 1.1 from the admin node and re-install the version 1.2.2 from 'radosgw-agent_1.2.2trusty_all.deb' file. The packages information as follows:
$ dpkg -l | grep radosgw
ii radosgw 0.80.9-0ubuntu0.14.04.2 amd64 REST gateway for RADOS distributed object store
ii radosgw-agent 1.2.2trusty all Rados gateway agents.

I tried to test data asynchronous sync again. I created the bucket named BASKET1 and put a file 'tw.json' in it.
$ s3cmd ls s3://BASKET1
2015-07-01 02:33 617 s3://BASKET1/tw.json

On master zone cluster admin node, I used the 'rados' command to check the pools' content:
$ sudo rados -p .rgw.buckets.index ls
.dir.tw-east.6930.1
$ sudo rados -p .rgw.buckets ls
tw-east.6930.1_tw.json

On secondary zone cluster admin node, the checked results as follows:
$ sudo rados -p .rgw.buckets.index ls
.dir.tw-east.6930.1
$ sudo rados -p .rgw.buckets ls

I found the following error message on the console. And I attached the partial logging file in this post as '*master_radosgw-sync.log*'.
2015-07-01 10:34:38,648 13396 [radosgw_agent.worker][INFO ] ************************************************************************
2015-07-01 10:34:38,648 13396 [radosgw_agent.worker][INFO ] syncing bucket "BASKET1"
2015-07-01 10:34:46,386 13396 [radosgw_agent.worker][WARNIN] encountered an error during sync: Http error code 500 content Internal Server Error
2015-07-01 10:34:46,398 13396 [radosgw_agent.worker][ERROR ] failed to sync object BASKET1/tw.json: state is error
2015-07-01 10:34:46,399 13396 [radosgw_agent.worker][WARNIN] will retry sync of failed object at next incremental sync
2015-07-01 10:34:54,085 13396 [radosgw_agent.worker][WARNIN] encountered an error during sync: Http error code 500 content Internal Server Error
2015-07-01 10:34:54,101 13396 [radosgw_agent.worker][ERROR ] failed to sync object BASKET1/tw.json: state is error
2015-07-01 10:34:54,101 13396 [radosgw_agent.worker][WARNIN] will retry sync of failed object at next incremental sync
2015-07-01 10:34:54,102 13396 [radosgw_agent.worker][INFO ] synced 0 objects
2015-07-01 10:34:54,102 13396 [radosgw_agent.worker][INFO ] completed syncing bucket "BASKET1"
2015-07-01 10:34:54,103 13396 [radosgw_agent.worker][INFO ] ************************************************************************

At the same time, I found the radosgw service is down by Caught signal. I attached the partial logging file in this post as '*secondary_radosgw.log*'.

Any help would be much appreciated.

#8 Updated by Wen-Dwo Hwang Hwang over 8 years ago

Hi:
I have update all environment's packages to date. The packages information as follows:
OS Ubuntu 14.04 AMD64
Apache2 2.4.7-1ubuntu4.4
Ceph 0.94.2-1trusty
Ceph-deploy 1.5.25trusty
Radosgw 0.94.2-1trusty
Radosgw-agent 1.2.3trusty

According my test results, only the pool of 'buckets.index' will replicate to secondary zone. The pool of 'buckets' still fail. When the master zone issue HTTP PUT command to secondary zone, I received 500 response from secondary zone. Some pieces of logging as follows:

(1) Issue HTTP PUT command from master zone (please reference radosgw-sync.log)
application/json; charset=UTF-8
Thu, 16 Jul 2015 10:15:11 GMT
x-amz-copy-source:BUCKET1/tw.json
/BUCKET1/tw.json
2015-07-16 18:15:11,506 2621 [boto][DEBUG ] Signature:
AWS tw_access_key:PM51wrFPVqvlzF1wZZDi3VCPZfQ=
2015-07-16 18:15:11,507 2621 [boto][DEBUG ] url = 'http://10.35.64.71/BUCKET1/tw.json'
params={'rgwx-op-id': 'node1:2611:1', 'rgwx-source-zone': u'tw-east', 'rgwx-client-id': 'radosgw-agent'}
headers={'Content-Length': '0', 'User-Agent': 'Boto/2.20.1 Python/2.7.6 Linux/3.13.0-57-generic', 'x-amz-copy-source': 'BUCKET1/tw.json', 'Date': 'Thu, 16 Jul 2015 10:1
5:11 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Authorization': 'AWS tw_access_key:PM51wrFPVqvlzF1wZZDi3VCPZfQ='}
data=None
2015-07-16 18:15:11,507 2621 [boto][DEBUG ] Method: PUT
2015-07-16 18:15:11,507 2621 [boto][DEBUG ] Path: /BUCKET1/tw.json?rgwx-op-id=node1%3A2611%3A1&rgwx-source-zone=tw-east&rgwx-client-id=radosgw-agent
2015-07-16 18:15:11,508 2621 [boto][DEBUG ] Data:
2015-07-16 18:15:11,508 2621 [boto][DEBUG ] Headers: {'Content-Type': 'application/json; charset=UTF-8', 'x-amz-copy-source': 'BUCKET1/tw.json'}
2015-07-16 18:15:11,509 2621 [boto][DEBUG ] Host: 10.35.64.71
2015-07-16 18:15:11,509 2621 [boto][DEBUG ] Port: 80
2015-07-16 18:15:11,510 2621 [boto][DEBUG ] Params: {'rgwx-op-id': 'node1%3A2611%3A1', 'rgwx-source-zone': 'tw-east', 'rgwx-client-id': 'radosgw-agent'}
2015-07-16 18:15:11,510 2621 [boto][DEBUG ] establishing HTTP connection: kwargs={'port': 80, 'timeout': 70}
2015-07-16 18:15:11,510 2621 [boto][DEBUG ] Token: None
2015-07-16 18:15:11,511 2621 [boto][DEBUG ] StringToSign:
PUT

(2) Receive 500 HTTP response from secondary zone (please reference ceph_access.log)
10.35.64.81 - - [16/Jul/2015:18:15:11 +0800] "PUT /BUCKET1/tw.json?rgwx-op-id=node1%3A2611%3A1&rgwx-source-zone=tw-east&rgwx-client-id=radosgw-agent HTTP/1.1" 500 276 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-57-generic"

Any help would be much appreciated.

#9 Updated by Wen-Dwo Hwang Hwang over 8 years ago

I copy the error logs from '/var/log/ceph/radosgw.node1.log' of secondary zone and paste here.

2015-07-16 18:15:12.433565 7f92d576a700 1 ====== starting new request req=0x7f932c066900 =====
2015-07-16 18:15:12.443404 7f92d576a700 1 ====== req done req=0x7f932c066900 http_status=200 ======
2015-07-16 18:15:12.765953 7f92dcf79700 0 > HTTP_DATE -> Thu Jul 16 10:15:12 2015
2015-07-16 18:15:12.765974 7f92dcf79700 0 > HTTP_X_AMZ_COPY_SOURCE -> BUCKET1/tw.json
2015-07-16 18:15:12.781711 7f92dcf79700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-07-16 18:15:12.781911 7f92dcf79700 1 ====== req done req=0x7f932c05caa0 http_status=500 ======
2015-07-16 18:15:12.913417 7f92eef9d700 1 ====== starting new request req=0x7f932c06ffc0 =====
2015-07-16 18:15:12.913787 7f92eef9d700 0 WARNING: couldn't find acl header for bucket, generating default
2015-07-16 18:15:14.288756 7f92eef9d700 0 > HTTP_DATE -> Thu Jul 16 10:15:14 2015
2015-07-16 18:15:14.288762 7f92eef9d700 0 > HTTP_X_AMZ_COPY_SOURCE -> BUCKET1/tw.json
2015-07-16 18:15:14.371828 7f92eef9d700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-07-16 18:15:14.371981 7f92eef9d700 1 ====== req done req=0x7f932c06ffc0 http_status=500 ======
2015-07-16 18:15:15.216671 7f92e7f8f700 1 ====== starting new request req=0x7f932c078ac0 =====
2015-07-16 18:15:15.217054 7f92e7f8f700 0 WARNING: couldn't find acl header for bucket, generating default
2015-07-16 18:15:15.277913 7f92e7f8f700 0 > HTTP_DATE -> Thu Jul 16 10:15:15 2015
2015-07-16 18:15:15.277921 7f92e7f8f700 0 > HTTP_X_AMZ_COPY_SOURCE -> BUCKET1/tw.json
2015-07-16 18:15:16.996784 7f92e7f8f700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-07-16 18:15:16.997290 7f92e7f8f700 1 ====== req done req=0x7f932c078ac0 http_status=500 ======
2015-07-16 18:15:20.397705 7f92c674c700 1 ====== starting new request req=0x7f932c0548f0 =====
2015-07-16 18:15:20.398114 7f92c674c700 0 WARNING: couldn't find acl header for bucket, generating default
2015-07-16 18:15:20.409736 7f92c674c700 0 > HTTP_DATE -> Thu Jul 16 10:15:20 2015
2015-07-16 18:15:20.409744 7f92c674c700 0 > HTTP_X_AMZ_COPY_SOURCE -> BUCKET1/tw.json
2015-07-16 18:15:20.420328 7f92c674c700 0 WARNING: set_req_state_err err_no=5 resorting to 500

#10 Updated by Wen-Dwo Hwang Hwang over 8 years ago

I add the option of 'debug ms = 1' and re-run again. I copy the error logs from '/var/log/ceph/radosgw.node1.log' of secondary zone and paste here.

2015-07-17 10:47:59.596399 7fc26df3b700 1 ====== starting new request req=0x7fc2ec0704f0 =====
2015-07-17 10:47:59.596812 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.71:6800/1727 -- osd_op(client.64105.0:742 BUCKET1 [call version.read,getxattrs,stat] 11
.a928b0d3 ack+read+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07bef0 con 0x7fc2e4009660
2015-07-17 10:47:59.598245 7fc2f01f2700 1 -- 10.35.64.71:0/1003845 <== osd.0 10.35.64.71:6800/1727 84 ==== osd_op_reply(742 BUCKET1 [call,getxattrs,stat] v0'0 uv1 ondi
sk = 0) v6 ==== 258+0+139 (4008103728 0 835450928) 0x7fc2d0091230 con 0x7fc2e4009660
2015-07-17 10:47:59.598458 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.71:6800/1727 -- osd_op(client.64105.0:743 BUCKET1 [call version.check_conds,call versio
n.read,read 0~524288] 11.a928b0d3 ack+read+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07ada0 con 0x7fc2e4009660
2015-07-17 10:47:59.599709 7fc2f01f2700 1 -- 10.35.64.71:0/1003845 <== osd.0 10.35.64.71:6800/1727 85 ==== osd_op_reply(743 BUCKET1 [call,call,read 0~123] v0'0 uv1 ond
isk = 0) v6 ==== 258+0+171 (2422035699 0 942473720) 0x7fc2d0091230 con 0x7fc2e4009660
2015-07-17 10:47:59.599912 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.71:6800/1727 -- osd_op(client.64105.0:744 .bucket.meta.BUCKET1:tw-east.24117.1 [call ve
rsion.read,getxattrs,stat] 11.45d3e176 ack+read+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07cbd0 con 0x7fc2e4009660
2015-07-17 10:47:59.601200 7fc2f01f2700 1 -- 10.35.64.71:0/1003845 <== osd.0 10.35.64.71:6800/1727 86 ==== osd_op_reply(744 .bucket.meta.BUCKET1:tw-east.24117.1 [call,
getxattrs,stat] v0'0 uv1 ondisk = 0) v6 ==== 287+0+322 (788713943 0 3995807945) 0x7fc2d0091950 con 0x7fc2e4009660
2015-07-17 10:47:59.601511 7fc26df3b700 0 WARNING: couldn't find acl header for bucket, generating default
2015-07-17 10:47:59.601546 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.72:6800/1575 -- osd_op(client.64105.0:745 tw-east [getxattrs,stat] 13.24dd5f83 ack+read
+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07cdf0 con 0x3456350
2015-07-17 10:47:59.603224 7fc2f0ff9700 1 -- 10.35.64.71:0/1003845 <== osd.1 10.35.64.72:6800/1575 284 ==== osd_op_reply(745 tw-east [getxattrs,stat] v0'0 uv0 ack = -2
((2) No such file or directory)) v6 ==== 216+0+0 (2563183836 0 0) 0x7fc2d0091e70 con 0x3456350
2015-07-17 10:47:59.603606 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.72:6800/1575 -- osd_op(client.64105.0:746 statelog.obj_opstate.82 [call statelog.add] 1
5.80e31fb2 ondisk+write+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07cdf0 con 0x3456350
2015-07-17 10:47:59.616924 7fc2f0ff9700 1 -- 10.35.64.71:0/1003845 <== osd.1 10.35.64.72:6800/1575 285 ==== osd_op_reply(746 statelog.obj_opstate.82 [call] v74'594 uv5
94 ondisk = 0) v6 ==== 190+0+0 (3118336139 0 0) 0x7fc2d0091e70 con 0x3456350
2015-07-17 10:47:59.617143 7fc26df3b700 0 > HTTP_DATE -> Fri Jul 17 02:47:59 2015
2015-07-17 10:47:59.617149 7fc26df3b700 0 > HTTP_X_AMZ_COPY_SOURCE -> BUCKET1/tw.json
2015-07-17 10:47:59.620134 7fc26df3b700 1 -- 10.35.64.71:0/1003845 --> 10.35.64.72:6800/1575 -- osd_op(client.64105.0:747 statelog.obj_opstate.82 [call statelog.add] 1
5.80e31fb2 ondisk+write+known_if_redirected e74) v5 -- ?+0 0x7fc2ec07cbd0 con 0x3456350
2015-07-17 10:47:59.627518 7fc2f0ff9700 1 -- 10.35.64.71:0/1003845 <== osd.1 10.35.64.72:6800/1575 286 ==== osd_op_reply(747 statelog.obj_opstate.82 [call] v74'595 uv5
95 ondisk = 0) v6 ==== 190+0+0 (3399264686 0 0) 0x7fc2d0091e70 con 0x3456350
2015-07-17 10:47:59.627686 7fc26df3b700 0 WARNING: set_req_state_err err_no=5 resorting to 500
2015-07-17 10:47:59.627764 7fc26df3b700 1 ====== req done req=0x7fc2ec0704f0 http_status=500 ======

#11 Updated by Wen-Dwo Hwang Hwang over 8 years ago

I add the option of 'debug ms = 20' and re-run again. The error messages are in the attachment file.

#12 Updated by Wen-Dwo Hwang Hwang over 8 years ago

Hi:
I change fastcgi module to Ceph patched version and test again. The result of data replication still fail. The package information as follows:
libapache2-mod-fastcgi 2.4.7~0910052141-ceph1

I create the bucket 'BASKET2', and put the file 'tw.json'. The logging information as follows:
(1) In the master zone
@application/json; charset=UTF-8
Fri, 17 Jul 2015 06:31:11 GMT
x-amz-copy-source:BASKET2/tw.json
/BASKET2/tw.json
2015-07-17 14:31:11,807 13867 [boto][DEBUG ] Signature:
AWS tw_access_key:lBmJreXR0ZTZsooV+rilICtgEYA=
2015-07-17 14:31:11,807 13867 [boto][DEBUG ] url = 'http://10.35.64.71/BASKET2/tw.json'
params={'rgwx-op-id': 'node1:13440:1', 'rgwx-source-zone': u'tw-east', 'rgwx-client-id': 'radosgw-agent'}
headers={'Content-Length': '0', 'User-Agent': 'Boto/2.20.1 Python/2.7.6 Linux/3.13.0-57-generic', 'x-amz-copy-source': 'BASKET2/tw.json', 'Date': 'Fri, 17 Jul 2015 06:31:11 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Authorization': 'AWS tw_access_key:lBmJreXR0ZTZsooV+rilICtgEYA='}
data=None
2015-07-17 14:31:11,807 13867 [boto][DEBUG ] Method: PUT
2015-07-17 14:31:11,807 13867 [boto][DEBUG ] Path: /BASKET2/tw.json?rgwx-op-id=node1%3A13440%3A1&rgwx-source-zone=tw-east&rgwx-client-id=radosgw-agent
2015-07-17 14:31:11,807 13867 [boto][DEBUG ] Data:
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] Headers: {'Content-Type': 'application/json; charset=UTF-8', 'x-amz-copy-source': 'BASKET2/tw.json'}
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] Host: 10.35.64.71
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] Port: 80
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] Params: {'rgwx-op-id': 'node1%3A13440%3A1', 'rgwx-source-zone': 'tw-east', 'rgwx-client-id': 'radosgw-agent'}
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] establishing HTTP connection: kwargs={'port': 80, 'timeout': 70}
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] Token: None
2015-07-17 14:31:11,808 13867 [boto][DEBUG ] StringToSign:
PUT

application/json; charset=UTF-8
Fri, 17 Jul 2015 06:31:11 GMT
x-amz-copy-source:BASKET2/tw.json
/BASKET2/tw.json
2015-07-17 14:31:11,809 13867 [boto][DEBUG ] Signature:
AWS tw_access_key:lBmJreXR0ZTZsooV+rilICtgEYA=
2015-07-17 14:31:11,897 13867 [radosgw_agent.worker][DEBUG ] object "BASKET2/tw.json" not found on master, deleting from secondary
2015-07-17 14:31:11,897 13867 [boto][DEBUG ] path=/BASKET2/
2015-07-17 14:31:11,897 13867 [boto][DEBUG ] auth_path=/BASKET2/
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] path=/BASKET2/?max-keys=0
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] auth_path=/BASKET2/?max-keys=0
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Method: GET
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Path: /BASKET2/?max-keys=0
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Data:
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Headers: {}
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Host: 10.35.64.71
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Port: 80
2015-07-17 14:31:11,898 13867 [boto][DEBUG ] Params: {}
2015-07-17 14:31:11,899 13867 [boto][DEBUG ] establishing HTTP connection: kwargs={'port': 80, 'timeout': 70}
2015-07-17 14:31:11,899 13867 [boto][DEBUG ] Token: None
2015-07-17 14:31:11,899 13867 [boto][DEBUG ] StringToSign:
GET@

(2) In the secondary zone, I got 404 HTTP response
10.35.64.81 - - [17/Jul/2015:14:31:11 +0800] "PUT /BASKET2/tw.json?rgwx-op-id=node1%3A13440%3A1&rgwx-source-zone=tw-east&rgwx-client-id=radosgw-agent HTTP/1.1" 404 242 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-57-generic"
10.35.64.81 - - [17/Jul/2015:14:31:11 +0800] "DELETE /BASKET2/tw.json HTTP/1.1" 204 126 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-57-generic"

#13 Updated by Alfredo Deza about 6 years ago

  • Status changed from Need More Info to Closed

The agent is no longer a supported tool.

Also available in: Atom PDF