Project

General

Profile

Bug #61472

Updated by Christopher Hoffman 11 months ago

This bug creates holes (zeros in this case) in the image where there should be data. This causes corruption on the block and filesystem level. A reproducer was created to illustrate the steps to cause the problem and show associated symptoms. 

 The environment: 
 <pre> 
 There's two sites, site-a and site-b. To start the scenario, an image is created with mirror mode. There then is a 1m mirror snapshot schedule set for that image. 
 </pre> 

 Here's how we arrive at the bug: 
 <pre> 
 1. Maps image on Primary 
 2. Formats then mount bdev 
 3a. starts manually creating 40 mirror image snapshots, every 3 seconds. This step will end 3b when completed.   
 3b. 3. untars via script provided for 300 minutes with timeout of 5m 1 minute snap sched interval 
 4. unmap then demote image on primary 
 5. wait for demote image to be synced to secondary 
 6. promote image on secondary 
 7. map image on secondary 
 8. fsck -fn image 
 9. at this point primary and secondary swap 
 10. Repeat starting at step 1 
 </pre> 

 After running In summary, the steps above workload ping pongs between site-a and site-b. 

 It was observed that after several iterations of the above, a data corruption occurred. 
 This can be seen running fsck on the volume: 
 <pre> 
 + sudo fsck -fn /dev/rbd0 
 fsck from util-linux 2.38.1 
 e2fsck 1.46.5 (30-Dec-2021) 
 Pass 1: Checking inodes, blocks, and sizes 
 HTREE directory inode 83315 has an invalid root node. 
 Clear HTree index? no 

 HTREE directory inode 83315 has an unsupported hash version (20) 
 Clear HTree index? no 

 HTREE directory inode 83315 uses an incompatible htree root node flag. 
 Clear HTree index? no 
 ... 
 </pre> 

 Steps to reproduce are: 
 1. In the build directory of a clone of upstream run: 
 <pre> 
 sh ./rbd_reproducer_corrupt1.sh 
 </pre> 
 2. This script will run until error code is returned (most likely going to be fsck). 
 


 In testing so far, it took 3 untars (5h*3=15h) to produce this error.

Back