Bug #12615: Repair of Erasure Coded pool with an unrepairable object causes pg state to lose clean state - RADOS - Ceph

Bug #12615

Updated by David Zafman almost 9 years ago

 
 After an erasure coded pull 2 + 1 with 2 chunks portions of a single object corrupted, doing a repair which can't succeed causes pg to lose clean state.    The result of a unclean pg is that operations hang 
 and trying to repair again just causes scrub to requeue continuously.    The EIO from rados get requires wip-12000-12200 branch changes. 

 <pre> 
 $ rados -p ecpool get foo dz.out3 
 error getting ecpool/foo: (5) Input/output error 
 $ ./ceph pg dump pgs | grep ^3.6 
 dumped pgs in format plain 
 3.6       1         0         0         0         0         1048576 1         1         active+clean      2015-08-04 16:14:41.607821        16'1      16:8      [0,1,2] 0         [0,1,2] 0         0'0       2015-08-04 16:14:40.526211        0'0       2015-08-04 16:14:40.526211 
 $ ceph pg repair 3.6 
 instructing pg 3.6 on osd.0 to repair 
 $ ceph pg dump pgs | grep ^3.6 
 dumped pgs in format plain 
 3.6       1         1         4         0         1         1048576 1         1         active    2015-08-04 16:15:39.659583        16'1      16:10     [0,1,2] 0         [0,1,2] 0         16'1      2015-08-04 16:15:39.659434        16'1      2015-08-04 16:15:39.659434 
 [~/ceph/src] (wip-12000-12200-new) 
 $ ./rados -p ecpool get foo dz.out3 
 ^C 
 </pre> 

 To get to active+clean, I removed the broken object from the filestore and restarted the osd.

Back

Project

General

Profile

Ceph » RADOS

Bug #12615