Project

General

Profile

Bug #8338

Updated by Greg Farnum almost 10 years ago

I probably did this, but I'm not sure how. Found a hung kclient run today, blocked on OSD ops. This was in the OSD log 
 <pre>2014-05-12 03:13:14.500637 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 84 ==== osd_op(mds.0.1:84 10000000043.00000000 [delete] 0.6c5a8256 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (822468431 0 0) 0x37eb480 con 0x362adc0 
 2014-05-12 03:13:14.500707 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.500829 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 85 ==== osd_op(mds.0.1:85 10000000044.00000000 [delete] 0.e8cacbda snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1242924679 0 0) 0x332b480 con 0x362adc0 
 2014-05-12 03:13:14.500879 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.500993 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 86 ==== osd_op(mds.0.1:86 10000000053.00000000 [delete] 0.7e2ec456 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2313742898 0 0) 0x332b6c0 con 0x362adc0 
 2014-05-12 03:13:14.501057 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.501172 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 87 ==== osd_op(mds.0.1:87 10000000055.00000000 [delete] 0.aa4eaeb2 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (796757021 0 0) 0x35beb40 con 0x362adc0 
 2014-05-12 03:13:14.501220 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.501357 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 88 ==== osd_op(mds.0.1:88 10000000057.00000000 [delete] 0.fbd03b2e snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2209893053 0 0) 0x35be900 con 0x362adc0 
 2014-05-12 03:13:14.501406 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.501517 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 89 ==== osd_op(mds.0.1:89 10000000059.00000000 [delete] 0.4b440ff1 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1029022957 0 0) 0x35be6c0 con 0x362adc0 
 2014-05-12 03:13:14.501581 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3 
 2014-05-12 03:13:14.501692 7fbc39415700    1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 90 ==== osd_op(mds.0.1:90 1000000005b.00000000 [delete] 0.e46123f9 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2287437818 0 0) 0x35be480 con 0x362adc0 
 2014-05-12 03:13:14.501749 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3</pre> 3</pre 

 Best I can tell, we only get that output (at debug_osd = 20! and neither a response nor a log message) if the PG doesn't have a primary, or if we queued it for a PG we expect to be created. 
 However, these logs are from 13 hours ago, and the OSD is on epoch 5, while the messages were sent at epoch 3. There is no check that ops should *still* go to us, if they were targeted correctly at send time. 

 The MDS thinks it has 21 ops in flight, and the kclient has 3 or something, against the 86 the OSD has waiting. We just need to check against the current map, if the op should actually be targeted at us.

Back