Bug #55310
closed[pacific] RadosGW instance of Cloud Sync zone crashes when objects are uploaded
0%
Description
Summary:
Cloud tier RadosGW crashes with std::length_error and does not sync to S3-enabled cloud storage on Pacific.
- Set up containerized Pacific cluster using Ceph-Ansible
- Create a zone with tier-type cloud, modify it with necessary tier-configs.
- Make necessary configurations (ceph.conf) for the new RadosGW that will serve the new zone.
- Start the new RadosGW and restart the existing ones.
- Put some objects in some bucket and the new RadosGW (cloud zone) will keep restarting.
Description:
We are setting up a Ceph cluster with three RadosGW instances. The deployment is a containerized deployment using ceph-ansible using container images on quay.io/repository/ceph/daemon.
After the cluster is up and running, we run the following commands to create a zone for the Cloud Sync module:
radosgw-admin zone create \
--rgw-zonegroup=cdn \
--rgw-zone=sync \
--endpoints=http://<rgw_ip>:8081 \ # IP address of first RGW node (controller-01), with a different port (8081)
--tier-type=cloud
radosgw-admin zone modify \
--rgw-zonegroup=cdn \
--rgw-zone=sync \
--tier-config=connection.access_key=<gcs_s3_access_key>,connection.secret=<gcs_s3_secret_key>,connection.endpoint=https://storage.googleapis.com
radosgw-admin zone modify \
--rgw-zonegroup=cdn \
--rgw-zone=sync \
--access-key=<rgw_access_key> \
--secret=<rgw_secret_key>
radosgw-admin period update --commit
Add the following to ceph.conf:
[client.rgw.controller-01.rgw1]
host = controller-01
keyring = /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/keyring
log file = /var/log/ceph/ceph-cluster-rgw-controller-01.rgw1.log
rgw frontends = beast endpoint=<rgw_ip>:8081
rgw thread pool size = 512
rgw_realm = myrealm
rgw_zone = sync
rgw_zonegroup = cdn
debug_rgw_sync = 5
debug_rgw = 5
Create necessary files and keyring for the new RGW instance:
mkdir /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/
echo "INST_NAME=rgw1" > /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/EnvironmentFile
ceph auth get-or-create client.rgw.controller-01.rgw1 osd 'allow rwx' mon 'allow rw' -o /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/keyring
Start the new RGW instance:
systemctl start ceph-radosgw@rgw.controller-01.rgw1.service
Restart existing RGW instances (optional):
systemctl start ceph-radosgw@rgw.controller-01.rgw0.service # on controller-01
systemctl start ceph-radosgw@rgw.controller-02.rgw0.service # on controller-02
systemctl start ceph-radosgw@rgw.controller-03.rgw0.service # on controller-03
The cloud RGW instance (rgw1) starts running. After a user and a bucket are created and some objects are put in the bucket, `rgw1` tries starting the cloud sync. It successfully creates the bucket in the destination cloud storage. It also finds the objects that need to be synced. However, it exits with the exception `std::length_error`, without being able to sync any of the objects. The logs of the container and the dump are included in this issue report.
Meanwhile, sync status shows the following:
# radosgw-admin sync status --rgw-zone=sync
realm 6fd94365-5255-4d37-a913-287223b2da78 (myrealm)
zonegroup dc9daf83-3e2e-47c8-8e63-99b3aa21d246 (myzg)
zone bccb41c9-1e88-423e-bb38-c09f5d575a63 (sync)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6f00b0bc-f04e-4c4d-9e90-356c0064dbf4 (avrupa)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 1 shards
behind shards: [42]
oldest incremental change not applied: 2022-04-10T19:19:38.256787+0300 [42]
The issue is reproducible in multiple versions of Pacific (incl. the first and latest stable release). When tested on multiple Octopus versions (incl. latest stable) with the same procedure and it works without an issue.
Environment:- Tested Ceph daemon container versions that are affected: v6.0.0-stable-6.0-pacific-centos8-x86_64, v6.0.6-stable-6.0-pacific-centos-8-x86_64
- Tested Ceph daemon container versions that are NOT affected: v5.0.9-stable-5.0-octopus-centos-8-x86_64, v5.0.14-stable-5.0-octopus-centos-8-x86_64
- Remote cloud storage provider: Google Cloud Storage (also tested with AWS S3)
Files
Updated by Casey Bodley over 1 year ago
- Is duplicate of Bug #57306: rgw: cloud sync crash added
Updated by Casey Bodley over 1 year ago
- Project changed from Ceph to rgw
- Status changed from New to Duplicate