Bug #8338: OSD: no longer checking that ops on older maps are correctly targeted - Ceph - Ceph

Actions

Copy link

Bug #8338

closed

OSD: no longer checking that ops on older maps are correctly targeted

Added by Greg Farnum almost 10 years ago. Updated almost 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I probably did this, but I'm not sure how. Found a hung kclient run today, blocked on OSD ops. This was in the OSD log

2014-05-12 03:13:14.500637 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 84 ==== osd_op(mds.0.1:84 10000000043.00000000 [delete] 0.6c5a8256 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (822468431 0 0) 0x37eb480 con 0x362adc0
2014-05-12 03:13:14.500707 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.500829 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 85 ==== osd_op(mds.0.1:85 10000000044.00000000 [delete] 0.e8cacbda snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1242924679 0 0) 0x332b480 con 0x362adc0
2014-05-12 03:13:14.500879 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.500993 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 86 ==== osd_op(mds.0.1:86 10000000053.00000000 [delete] 0.7e2ec456 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2313742898 0 0) 0x332b6c0 con 0x362adc0
2014-05-12 03:13:14.501057 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501172 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 87 ==== osd_op(mds.0.1:87 10000000055.00000000 [delete] 0.aa4eaeb2 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (796757021 0 0) 0x35beb40 con 0x362adc0
2014-05-12 03:13:14.501220 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501357 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 88 ==== osd_op(mds.0.1:88 10000000057.00000000 [delete] 0.fbd03b2e snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2209893053 0 0) 0x35be900 con 0x362adc0
2014-05-12 03:13:14.501406 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501517 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 89 ==== osd_op(mds.0.1:89 10000000059.00000000 [delete] 0.4b440ff1 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1029022957 0 0) 0x35be6c0 con 0x362adc0
2014-05-12 03:13:14.501581 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501692 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 90 ==== osd_op(mds.0.1:90 1000000005b.00000000 [delete] 0.e46123f9 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2287437818 0 0) 0x35be480 con 0x362adc0
2014-05-12 03:13:14.501749 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3

Best I can tell, we only get that output (at debug_osd = 20! and neither a response nor a log message) if the PG doesn't have a primary, or if we queued it for a PG we expect to be created.
However, these logs are from 13 hours ago, and the OSD is on epoch 5, while the messages were sent at epoch 3. There is no check that ops should still go to us, if they were targeted correctly at send time.

The MDS thinks it has 21 ops in flight, and the kclient has 3 or something, against the 86 the OSD has waiting. We just need to check against the current map, if the op should actually be targeted at us.

Related issues 1 (0 open — 1 closed)