Bug #12615
Updated by David Zafman almost 9 years ago
After an erasure coded pull 2 + 1 with 2 chunks portions of a single object corrupted, doing a repair which can't succeed causes pg to lose clean state. The result of a unclean pg is that operations hang and trying to repair again just causes scrub to requeue continuously. The EIO from rados get requires wip-12000-12200 branch changes. <pre> $ rados -p ecpool get foo dz.out3 error getting ecpool/foo: (5) Input/output error $ ./ceph pg dump pgs | grep ^3.6 dumped pgs in format plain 3.6 1 0 0 0 0 1048576 1 1 active+clean 2015-08-04 16:14:41.607821 16'1 16:8 [0,1,2] 0 [0,1,2] 0 0'0 2015-08-04 16:14:40.526211 0'0 2015-08-04 16:14:40.526211 $ ceph pg repair 3.6 instructing pg 3.6 on osd.0 to repair $ ceph pg dump pgs | grep ^3.6 dumped pgs in format plain 3.6 1 1 4 0 1 1048576 1 1 active 2015-08-04 16:15:39.659583 16'1 16:10 [0,1,2] 0 [0,1,2] 0 16'1 2015-08-04 16:15:39.659434 16'1 2015-08-04 16:15:39.659434 [~/ceph/src] (wip-12000-12200-new) $ ./rados -p ecpool get foo dz.out3 ^C </pre> To get to active+clean, I removed the broken object from the filestore and restarted the osd.