Bug #46062
closedFile Corruption in Multisite Replication with Encryption
100%
Description
This may be related to https://tracker.ceph.com/issues/39992 - though I didn't see any mention of encryption in that issue.
We've noticed that anything over ~8MB is consistently being modified ("corrupted") when replicated to the other site.
In the following example output, the first file being compared is the original that was uploaded (via aws cli) to the source zone and the second file (preceded by "checks/") is being downloaded (via aws cli) from the secondary/replicated zone:
$ cmp data.file_8M checks/data.file_8M
$ cmp data.file_9M checks/data.file_9M
data.file_9M checks/data.file_9M differ: byte 8388622, line 32962
$ cmp data.file_10M checks/data.file_10M
data.file_10M checks/data.file_10M differ: byte 8388622, line 32301
$ cmp data.file_90M checks/data.file_90M
data.file_90M checks/data.file_90M differ: byte 8388622, line 32593
Example of how the test file is being created w/ dd and /dev/urandom:
$ dd if=/dev/urandom of=data.file_9M bs=9k count=1k
1024+0 records in
1024+0 records out
9437184 bytes (9.4 MB) copied, 1.15529 s, 8.2 MB/s
We are seeing this only when encryption is turned on. Also, this is using the "automatic encryption" method by adding the following config parameter in ceph.conf for all of the RGWs:
rgw_crypt_default_encryption_key = [base64-encoded 256 bit key]
We recently noticed this when turning replication on in version 14.2.4 - and subsequently updated to version 14.2.9 to see if the issue is still present (it is).
Updated by Casey Bodley almost 4 years ago
- Priority changed from Normal to High
- Tags set to multisite encryption
Updated by Howard Brown almost 4 years ago
Verified issue still occurs after updating to 14.2.10; will also try upgrading to 15.2.4 just to see if it's an issue there as well.
Updated by Howard Brown almost 4 years ago
Confirmed same issue occurs in 15.2.4:
$ cmp data.file_15.2.4_8M checks/data.file_15.2.4_8M
$ cmp data.file_15.2.4_9M checks/data.file_15.2.4_9M
data.file_15.2.4_9M checks/data.file_15.2.4_9M differ: byte 8388622, line 32962
$ cmp data.file_15.2.4_90M checks/data.file_15.2.4_90M
data.file_15.2.4_90M checks/data.file_15.2.4_90M differ: byte 8388622, line 32593
Updated by Casey Bodley 11 months ago
- Status changed from New to In Progress
- Assignee set to Marcus Watts
trivial to reproduce in a two-zone multisite configuration where both zones override rgw crypt default encryption key = 4YSmvJtBv0aZ7geVgAsdpRnLBEwWSWlMIGnRS8a9TSA=
(required https://github.com/ceph/ceph/pull/51786 to fix default encryption)
s3cmd config file c1.s3cfg
targets the primary zone, and c2.s3cfg
targets the secondary. a 5m
file is uploaded in a single part, and a 6m
file is uploaded as multipart. s3cmd detects the md5 mismatch when reading the replica of 6m
from the secondary zone
~/ceph/build $ dd if=/dev/random of=5m bs=1M count=5 ~/ceph/build $ dd if=/dev/random of=6m bs=1M count=6 ~/ceph/build $ s3cmd -c c1.s3cfg mb s3://testbucket Bucket 's3://testbucket/' created ~/ceph/build $ s3cmd -c c1.s3cfg put 5m s3://testbucket upload: '5m' -> 's3://testbucket/5m' [1 of 1] 5242880 of 5242880 100% in 3s 1676.08 KB/s done ~/ceph/build $ s3cmd -c c1.s3cfg --multipart-chunk-size-mb=5 put 6m s3://testbucket upload: '6m' -> 's3://testbucket/6m' [part 1 of 2, 5MB] [1 of 1] 5242880 of 5242880 100% in 0s 81.83 MB/s done upload: '6m' -> 's3://testbucket/6m' [part 2 of 2, 1024KB] [1 of 1] 1048576 of 1048576 100% in 0s 48.43 MB/s done ~/ceph/build $ s3cmd -c c2.s3cfg get s3://testbucket/5m 5m.c2 download: 's3://testbucket/5m' -> '5m.c2' [1 of 1] 5242880 of 5242880 100% in 0s 176.85 MB/s done ~/ceph/build $ s3cmd -c c2.s3cfg get s3://testbucket/6m 6m.c2 download: 's3://testbucket/6m' -> '6m.c2' [1 of 1] 6291456 of 6291456 100% in 0s 164.28 MB/s done WARNING: MD5 signatures do not match: computed=140e7c8e37f569f54f774d92b78772c4, received=0404e856c5ca9e95202a537102c92fae
Updated by Casey Bodley 11 months ago
work in progress on a repair tool in https://github.com/ceph/ceph/pull/51842
Updated by Casey Bodley 9 months ago
- Status changed from In Progress to Pending Backport
- Assignee changed from Marcus Watts to Casey Bodley
Updated by Backport Bot 9 months ago
- Copied to Backport #62321: quincy: File Corruption in Multisite Replication with Encryption added
Updated by Backport Bot 9 months ago
- Copied to Backport #62322: pacific: File Corruption in Multisite Replication with Encryption added
Updated by Backport Bot 9 months ago
- Copied to Backport #62323: reef: File Corruption in Multisite Replication with Encryption added
Updated by Backport Bot 9 months ago
- Tags changed from multisite encryption to multisite encryption backport_processed
Updated by Konstantin Shalygin about 1 month ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100