Project

General

Profile

Actions

Bug #46062

closed

File Corruption in Multisite Replication with Encryption

Added by Howard Brown almost 4 years ago. Updated 29 days ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
-
% Done:

100%

Source:
Tags:
multisite encryption backport_processed
Backport:
pacific quincy reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This may be related to https://tracker.ceph.com/issues/39992 - though I didn't see any mention of encryption in that issue.

We've noticed that anything over ~8MB is consistently being modified ("corrupted") when replicated to the other site.

In the following example output, the first file being compared is the original that was uploaded (via aws cli) to the source zone and the second file (preceded by "checks/") is being downloaded (via aws cli) from the secondary/replicated zone:

$ cmp data.file_8M checks/data.file_8M
$ cmp data.file_9M checks/data.file_9M
data.file_9M checks/data.file_9M differ: byte 8388622, line 32962
$ cmp data.file_10M checks/data.file_10M
data.file_10M checks/data.file_10M differ: byte 8388622, line 32301
$ cmp data.file_90M checks/data.file_90M
data.file_90M checks/data.file_90M differ: byte 8388622, line 32593

Example of how the test file is being created w/ dd and /dev/urandom:

$ dd if=/dev/urandom of=data.file_9M bs=9k count=1k
1024+0 records in
1024+0 records out
9437184 bytes (9.4 MB) copied, 1.15529 s, 8.2 MB/s

We are seeing this only when encryption is turned on. Also, this is using the "automatic encryption" method by adding the following config parameter in ceph.conf for all of the RGWs:

rgw_crypt_default_encryption_key = [base64-encoded 256 bit key]

We recently noticed this when turning replication on in version 14.2.4 - and subsequently updated to version 14.2.9 to see if the issue is still present (it is).


Related issues 3 (0 open3 closed)

Copied to rgw - Backport #62321: quincy: File Corruption in Multisite Replication with EncryptionResolvedCasey BodleyActions
Copied to rgw - Backport #62322: pacific: File Corruption in Multisite Replication with EncryptionRejectedCasey BodleyActions
Copied to rgw - Backport #62323: reef: File Corruption in Multisite Replication with EncryptionResolvedCasey BodleyActions
Actions #1

Updated by Casey Bodley almost 4 years ago

  • Priority changed from Normal to High
  • Tags set to multisite encryption
Actions #2

Updated by Howard Brown almost 4 years ago

Verified issue still occurs after updating to 14.2.10; will also try upgrading to 15.2.4 just to see if it's an issue there as well.

Actions #3

Updated by Howard Brown almost 4 years ago

Confirmed same issue occurs in 15.2.4:

$ cmp data.file_15.2.4_8M checks/data.file_15.2.4_8M
$ cmp data.file_15.2.4_9M checks/data.file_15.2.4_9M
data.file_15.2.4_9M checks/data.file_15.2.4_9M differ: byte 8388622, line 32962
$ cmp data.file_15.2.4_90M checks/data.file_15.2.4_90M
data.file_15.2.4_90M checks/data.file_15.2.4_90M differ: byte 8388622, line 32593

Actions #4

Updated by Casey Bodley over 2 years ago

  • Assignee set to Casey Bodley
Actions #5

Updated by Casey Bodley over 1 year ago

  • Assignee deleted (Casey Bodley)
Actions #6

Updated by Casey Bodley 11 months ago

  • Status changed from New to In Progress
  • Assignee set to Marcus Watts

trivial to reproduce in a two-zone multisite configuration where both zones override rgw crypt default encryption key = 4YSmvJtBv0aZ7geVgAsdpRnLBEwWSWlMIGnRS8a9TSA= (required https://github.com/ceph/ceph/pull/51786 to fix default encryption)

s3cmd config file c1.s3cfg targets the primary zone, and c2.s3cfg targets the secondary. a 5m file is uploaded in a single part, and a 6m file is uploaded as multipart. s3cmd detects the md5 mismatch when reading the replica of 6m from the secondary zone

~/ceph/build $ dd if=/dev/random of=5m bs=1M count=5
~/ceph/build $ dd if=/dev/random of=6m bs=1M count=6
~/ceph/build $ s3cmd -c c1.s3cfg mb s3://testbucket
Bucket 's3://testbucket/' created
~/ceph/build $ s3cmd -c c1.s3cfg put 5m s3://testbucket
upload: '5m' -> 's3://testbucket/5m'  [1 of 1]
 5242880 of 5242880   100% in    3s  1676.08 KB/s  done
~/ceph/build $ s3cmd -c c1.s3cfg --multipart-chunk-size-mb=5 put 6m s3://testbucket
upload: '6m' -> 's3://testbucket/6m'  [part 1 of 2, 5MB] [1 of 1]
 5242880 of 5242880   100% in    0s    81.83 MB/s  done
upload: '6m' -> 's3://testbucket/6m'  [part 2 of 2, 1024KB] [1 of 1]
 1048576 of 1048576   100% in    0s    48.43 MB/s  done
~/ceph/build $ s3cmd -c c2.s3cfg get s3://testbucket/5m 5m.c2
download: 's3://testbucket/5m' -> '5m.c2'  [1 of 1]
 5242880 of 5242880   100% in    0s   176.85 MB/s  done
~/ceph/build $ s3cmd -c c2.s3cfg get s3://testbucket/6m 6m.c2
download: 's3://testbucket/6m' -> '6m.c2'  [1 of 1]
 6291456 of 6291456   100% in    0s   164.28 MB/s  done
WARNING: MD5 signatures do not match: computed=140e7c8e37f569f54f774d92b78772c4, received=0404e856c5ca9e95202a537102c92fae
Actions #7

Updated by Casey Bodley 11 months ago

  • Backport set to pacific quincy reef
Actions #8

Updated by Casey Bodley 11 months ago

work in progress on a repair tool in https://github.com/ceph/ceph/pull/51842

Actions #9

Updated by Casey Bodley 9 months ago

  • Status changed from In Progress to Pending Backport
  • Assignee changed from Marcus Watts to Casey Bodley
Actions #10

Updated by Backport Bot 9 months ago

  • Copied to Backport #62321: quincy: File Corruption in Multisite Replication with Encryption added
Actions #11

Updated by Backport Bot 9 months ago

  • Copied to Backport #62322: pacific: File Corruption in Multisite Replication with Encryption added
Actions #12

Updated by Backport Bot 9 months ago

  • Copied to Backport #62323: reef: File Corruption in Multisite Replication with Encryption added
Actions #13

Updated by Backport Bot 9 months ago

  • Tags changed from multisite encryption to multisite encryption backport_processed
Actions #14

Updated by Casey Bodley 9 months ago

  • Pull request ID set to 52248
Actions #15

Updated by Konstantin Shalygin 29 days ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF