Bug #55310: [pacific] RadosGW instance of Cloud Sync zone crashes when objects are uploaded - rgw - Ceph

Actions

Copy link

Bug #55310

closed

[pacific] RadosGW instance of Cloud Sync zone crashes when objects are uploaded

Added by Baturay Soysal about 2 years ago. Updated over 1 year ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.0, Ceph - v16.2.7

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Summary:
Cloud tier RadosGW crashes with std::length_error and does not sync to S3-enabled cloud storage on Pacific.

How to reproduce it:

Set up containerized Pacific cluster using Ceph-Ansible
Create a zone with tier-type cloud, modify it with necessary tier-configs.
Make necessary configurations (ceph.conf) for the new RadosGW that will serve the new zone.
Start the new RadosGW and restart the existing ones.
Put some objects in some bucket and the new RadosGW (cloud zone) will keep restarting.

Description:
We are setting up a Ceph cluster with three RadosGW instances. The deployment is a containerized deployment using ceph-ansible using container images on quay.io/repository/ceph/daemon.

After the cluster is up and running, we run the following commands to create a zone for the Cloud Sync module:

radosgw-admin zone create \
    --rgw-zonegroup=cdn \
    --rgw-zone=sync \
    --endpoints=http://<rgw_ip>:8081 \      # IP address of first RGW node (controller-01), with a different port (8081)
    --tier-type=cloud

radosgw-admin zone modify \
    --rgw-zonegroup=cdn \
    --rgw-zone=sync \
    --tier-config=connection.access_key=<gcs_s3_access_key>,connection.secret=<gcs_s3_secret_key>,connection.endpoint=https://storage.googleapis.com

radosgw-admin zone modify \
    --rgw-zonegroup=cdn \
    --rgw-zone=sync \
    --access-key=<rgw_access_key> \
    --secret=<rgw_secret_key>

radosgw-admin period update --commit

Add the following to ceph.conf:

[client.rgw.controller-01.rgw1]
host = controller-01
keyring = /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/keyring
log file = /var/log/ceph/ceph-cluster-rgw-controller-01.rgw1.log
rgw frontends = beast endpoint=<rgw_ip>:8081
rgw thread pool size = 512
rgw_realm = myrealm
rgw_zone = sync
rgw_zonegroup = cdn
debug_rgw_sync = 5
debug_rgw = 5

Create necessary files and keyring for the new RGW instance:

mkdir /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/
echo "INST_NAME=rgw1" > /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/EnvironmentFile
ceph auth get-or-create client.rgw.controller-01.rgw1 osd 'allow rwx' mon 'allow rw' -o /var/lib/ceph/radosgw/ceph-cluster-rgw.controller-01.rgw1/keyring

Start the new RGW instance:

systemctl start ceph-radosgw@rgw.controller-01.rgw1.service

Restart existing RGW instances (optional):

systemctl start ceph-radosgw@rgw.controller-01.rgw0.service     # on controller-01
systemctl start ceph-radosgw@rgw.controller-02.rgw0.service     # on controller-02
systemctl start ceph-radosgw@rgw.controller-03.rgw0.service     # on controller-03

The cloud RGW instance (rgw1) starts running. After a user and a bucket are created and some objects are put in the bucket, `rgw1` tries starting the cloud sync. It successfully creates the bucket in the destination cloud storage. It also finds the objects that need to be synced. However, it exits with the exception `std::length_error`, without being able to sync any of the objects. The logs of the container and the dump are included in this issue report.

Meanwhile, sync status shows the following:

# radosgw-admin sync status --rgw-zone=sync
          realm 6fd94365-5255-4d37-a913-287223b2da78 (myrealm)
      zonegroup dc9daf83-3e2e-47c8-8e63-99b3aa21d246 (myzg)
           zone bccb41c9-1e88-423e-bb38-c09f5d575a63 (sync)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: 6f00b0bc-f04e-4c4d-9e90-356c0064dbf4 (avrupa)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [42]
                        oldest incremental change not applied: 2022-04-10T19:19:38.256787+0300 [42]

The issue is reproducible in multiple versions of Pacific (incl. the first and latest stable release). When tested on multiple Octopus versions (incl. latest stable) with the same procedure and it works without an issue.

Environment:

Tested Ceph daemon container versions that are affected: v6.0.0-stable-6.0-pacific-centos8-x86_64, v6.0.6-stable-6.0-pacific-centos-8-x86_64
Tested Ceph daemon container versions that are NOT affected: v5.0.9-stable-5.0-octopus-centos-8-x86_64, v5.0.14-stable-5.0-octopus-centos-8-x86_64
Remote cloud storage provider: Google Cloud Storage (also tested with AWS S3)

Files

cloud-sync-pacific.log (334 KB) cloud-sync-pacific.log

Baturay Soysal, 04/13/2022 06:39 AM

Related issues 1 (1 open — 0 closed)