Backport #15171: hammer: osd: corruption when min_read_recency_for_promote > 1 - Ceph - Ceph

Backport #15171

Updated by Loïc Dachary about 8 years ago

https://github.com/ceph/ceph/pull/8187 

 <pre> 
 just got done with a test against a build of 0.94.6 minus the two commits 
 that were backported in PR 7207. everything worked as it should with the 
 cache-mode set to writeback and the min_read_recency_for_promote set to 2. 
 assuming it works properly on master, there must be a commit that we're 
 missing on the backport to support this properly. 

 sage, 
 i'm adding you to the recipients on this so hopefully you see it. the tl;dr 
 version is that the backport of the cache recency fix to hammer doesn't 
 work right and potentially corrupts data when 
 the min_read_recency_for_promote is set to greater than 1. 

 mike 

 On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell <mike.lovell@endurance.com> 
 wrote: 

 > robert and i have done some further investigation the past couple days on 
 > this. we have a test environment with a hard drive tier and an ssd tier as 
 > a cache. several vms were created with volumes from the ceph cluster. i did 
 > a test in each guest where i un-tarred the linux kernel source multiple 
 > times and then did a md5sum check against all of the files in the resulting 
 > source tree. i started off with the monitors and osds running 0.94.5 and 
 > never saw any problems. 
 > 
 > a single node was then upgraded to 0.94.6 which has osds in both the ssd 
 > and hard drive tier. i then proceeded to run the same test and, while the 
 > untar and md5sum operations were running, i changed the ssd tier cache-mode 
 > from forward to writeback. almost immediately the vms started reporting io 
 > errors and odd data corruption. the remainder of the cluster was updated to 
 > 0.94.6, including the monitors, and the same thing happened. 
 > 
 > things were cleaned up and reset and then a test was run 
 > where min_read_recency_for_promote for the ssd cache pool was set to 1. we 
 > previously had it set to 6. there was never an error with the recency 
 > setting set to 1. i then tested with it set to 2 and it immediately caused 
 > failures. we are currently thinking that it is related to the backport of 
 > the fix for the recency promotion and are in progress of making a .6 build 
 > without that backport to see if we can cause corruption. is anyone using a 
 > version from after the original recency fix (PR 6702) with a cache tier in 
 > writeback mode? anyone have a similar problem? 
 > 
 > mike 
 > 
 > On Mon, Mar 14, 2016 at 8:51 PM, Mike Lovell <mike.lovell@endurance.com> 
 > wrote: 
 > 
 >> something weird happened on one of the ceph clusters that i administer 
 >> tonight which resulted in virtual machines using rbd volumes seeing 
 >> corruption in multiple forms. 
 >> 
 >> when everything was fine earlier in the day, the cluster was a number of 
 >> storage nodes spread across 3 different roots in the crush map. the first 
 >> bunch of storage nodes have both hard drives and ssds in them with the hard 
 >> drives in one root and the ssds in another. there is a pool for each and 
 >> the pool for the ssds is a cache tier for the hard drives. the last set of 
 >> storage nodes were in a separate root with their own pool that is being 
 >> used for burn in testing. 
 >> 
 >> these nodes had run for a while with test traffic and we decided to move 
 >> them to the main root and pools. the main cluster is running 0.94.5 and the 
 >> new nodes got 0.94.6 due to them getting configured after that was 
 >> released. i removed the test pool and did a ceph osd crush move to move the 
 >> first node into the main cluster, the hard drives into the root for that 
 >> tier of storage and the ssds into the root and pool for the cache tier. 
 >> each set was done about 45 minutes apart and they ran for a couple hours 
 >> while performing backfill without any issue other than high load on the 
 >> cluster. 
 >> 
 >> we normally run the ssd tier in the forward cache-mode due to the ssds we 
 >> have not being able to keep up with the io of writeback. this results in io 
 >> on the hard drives slowing going up and performance of the cluster starting 
 >> to suffer. about once a week, i change the cache-mode between writeback and 
 >> forward for short periods of time to promote actively used data to the 
 >> cache tier. this moves io load from the hard drive tier to the ssd tier and 
 >> has been done multiple times without issue. i normally don't do this while 
 >> there are backfills or recoveries happening on the cluster but decided to 
 >> go ahead while backfill was happening due to the high load. 
 >> 
 >> i tried this procedure to change the ssd cache-tier between writeback and 
 >> forward cache-mode and things seemed okay from the ceph cluster. about 10 
 >> minutes after the first attempt a changing the mode, vms using the ceph 
 >> cluster for their storage started seeing corruption in multiple forms. the 
 >> mode was flipped back and forth multiple times in that time frame and its 
 >> unknown if the corruption was noticed with the first change or subsequent 
 >> changes. the vms were having issues of filesystems having errors and 
 >> getting remounted RO and mysql databases seeing corruption (both myisam and 
 >> innodb). some of this was recoverable but on some filesystems there was 
 >> corruption that lead to things like lots of data ending up in the 
 >> lost+found and some of the databases were un-recoverable (backups are 
 >> helping there). 
 >> 
 >> i'm not sure what would have happened to cause this corruption. the 
 >> libvirt logs for the qemu processes for the vms did not provide any output 
 >> of problems from the ceph client code. it doesn't look like any of the qemu 
 >> processes had crashed. also, it has now been several hours since this 
 >> happened with no additional corruption noticed by the vms. it doesn't 
 >> appear that we had any corruption happen before i attempted the flipping of 
 >> the ssd tier cache-mode. 
 >> 
 >> the only think i can think of that is different between this time doing 
 >> this procedure vs previous attempts was that there was the one storage node 
 >> running 0.94.6 where the remainder were running 0.94.5. is is possible that 
 >> something changed between these two releases that would have caused 
 >> problems with data consistency related to the cache tier? or otherwise? any 
 >> other thoughts or suggestions? 
 >> 
 >> thanks in advance for any help you can provide. 
 >> 
 >> mike 
 >> 
 > 
 > 

 </pre>

Back

Project

General

Profile

Ceph

Backport #15171