Project

General

Profile

Bug #35974

Updated by Jason Dillaman over 5 years ago

From the ML: 

 <pre> 
 We utilize Ceph RBDs for our users' storage and need to keep data synchronized across data centres. For this we rely on 'rbd export-diff / import-diff'. Lately we have been noticing cases in which the file system on the 'destination RBD' is corrupt. We have been trying to isolate the issue, which may or may not be due to Ceph. We suspect the problem could be in 'rbd export-diff / import-diff' and are wondering if people have been seeing issues with these tools. Let me explain our use case and issue in more detail. 

 
 We have a number of data centres each with a Ceph cluster storing tens of thousands of RBDs. We maintain extra copies of each RBD in other data centres. After we are 'done' using a RBD, we create a snapshot and use 'rbd export-diff' to create a diff between the most recent 'common' snapshot at the other data center. We send the data over the network, and use 'rbd import-diff' on the destination. When we apply a diff to a destination RBD we can guarantee its 'HEAD' is clean. Of course we guarantee that an RBD is only used in one data centre at a time. 

 
 We noticed corruption at the destination RBD based on fsck failures, further investigation showed that checksums on the RBD mismatch as well. Somehow the data is sometimes getting corrupted either by our software or 'rbd export-diff / import-diff'. Our investigation suggests that the the problem is in 'rbd export-diff/import-diff'. The main evidence of this is that occasionally we sync an RBD between multiple data centres. Each sync is a separate job with its own 'rbd export-diff'. We noticed that both destination locations have the same corruption (and the same checksum) and the source is healthy. 

 
 In addition to this, we are seeing a similar type of corruption in another use case when we migrate RBDs and snapshots across pools. In this case we clone a version of an RBD (e.g. HEAD-3) to a new pool and rely on 'rbd export-diff/import-diff' to restore the last 3 snapshots on top. Here too we see cases of fsck and RBD checksum failures. 
 We maintain various metrics and logs. Looking back at our data we have seen the issue at a small scale for a while on Jewel, but the frequency increased recently. The timing may have coincided with a move to Luminous, but this may be coincidence. We are currently on Ceph 12.2.5. 

 
 We are wondering if people are experiencing similar issues with 'rbd export-diff / import-diff'. I'm sure many people use it to keep backups in sync. Since it is backups, many people may not inspect the data often. In our use case, we use this mechanism to keep data in sync and actually need the data in the other location often. We are wondering if anyone else has encountered any issues, it's quite possible that many people may have this issue, buts simply don't realize. We are likely hitting it much more frequently due to the scale of our operation (tens of thousands of syncs a day). 
 </pre>

Back