Bug #9503: Dumpling: removing many snapshots in a short time makes OSDs go berserk - Ceph - Ceph

Bug #9503

Updated by Florian Haas over 9 years ago

Back in March, there was a report from Craig Lewis on the users list that mentioned several OSDs going to 100% CPU for no apparent reason, spinning and not responding to heartbeats until they were marked down by their peers, only to reappear a few minutes later and putting the cluster in a state of perpetual degradation and recovery: 

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg08355.html 

 The user at the time reported that removing *all* snapshots in the cluster and restarting the affected OSDs made the issue go away. Thus, no-one ever got to the bottom of it. That report was for Emperor, and I am now seeing the same problem in a Dumpling (0.67.10) cluster with 4 OSD nodes and a total of 92 OSDs (not exactly an optimal OSD-to-node ratio, but it shouldn't cause the cluster to go haywire the way it does). 

 We see CPU-hogging hosts spinning spewing these messages in the logs: 

 <pre> 
 2014-09-16 23:38:52.383297 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f55(u 
 nlocked)] enter Initial 
 2014-09-16 23:38:52.693380 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f55( 
 v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les 
 /c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0' 
 0 inactive NOTIFY] exit Initial 0.310083 0 0.000000 
 2014-09-16 23:38:52.693393 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f55( 
 v 86054'140712 (72936'132027,86054'140712] local-les=85226 n=931 ec=2992 les 
 /c 85226/85226 85225/85225/85225) [63,85] r=1 lpr=0 pi=85155-85224/1 lcod 0' 
 0 inactive NOTIFY] enter Reset 
 2014-09-16 23:38:52.693952 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f77(u 
 nlocked)] enter Initial 
 2014-09-16 23:38:52.963101 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f77( 
 v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les 
 /c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0' 
 0 inactive NOTIFY] exit Initial 0.269148 0 0.000000 
 2014-09-16 23:38:52.963115 7f59b5ff0780    5 osd.85 pg_epoch: 88246 pg[5.f77( 
 v 86054'130308 (73304'121701,86054'130308] local-les=85219 n=981 ec=2992 les 
 /c 85219/85219 85218/85218/85218) [48,85] r=1 lpr=0 pi=85145-85217/1 lcod 0' 
 0 inactive NOTIFY] enter Reset 
 </pre> 

 The OSDs then settle into a state where they log this (and are still at 100%, getting randomly marked out and in): 

 <pre> 
 2014-09-16 23:44:18.329467 7f59340e5700    5 osd.85 88499 heartbeat: osd_stat( 
 993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1 
 6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist []) 
 2014-09-16 23:44:21.231029 7f59340e5700    5 osd.85 88499 heartbeat: osd_stat( 
 993 GB used, 2730 GB avail, 3723 GB total, peers [0,2,3,4,5,6,7,9,11,12,14,1 
 6,17,20,23,43,48,50,51,53,54,55,56,60,61,64,65,68,70,84,86]/[] op hist []) 
 </pre> 

 The first set of log messages superficially looks similar to issue #6565 which, per http://ceph.com/releases/v0-67-9-dumpling-released/, should have been fixed in 0.67.9. This system is running 0.67.10, which has me tentatively rule that bug out (unless the fix broke something). 

 Just like the issue reported by Craig, this system has lots of RBD snapshots (RBD, all (all using image format 1, because the intended primary use case is kernel-mapped RBD volume). Unlike Craig's system, it runs no radosgw.  

 Shutting down the OSDs and marking them out serves no purpose other than making the issue promptly reappear somewhere else (just as Craig reported in March). 

 If I can provide any additional information to get to the bottom of this (as it's apparently unresolved at least in current Dumpling), please let me know.

Back

Project

General

Profile

Ceph

Bug #9503