Project

General

Profile

Actions

Bug #8338

closed

OSD: no longer checking that ops on older maps are correctly targeted

Added by Greg Farnum almost 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I probably did this, but I'm not sure how. Found a hung kclient run today, blocked on OSD ops. This was in the OSD log

2014-05-12 03:13:14.500637 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 84 ==== osd_op(mds.0.1:84 10000000043.00000000 [delete] 0.6c5a8256 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (822468431 0 0) 0x37eb480 con 0x362adc0
2014-05-12 03:13:14.500707 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.500829 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 85 ==== osd_op(mds.0.1:85 10000000044.00000000 [delete] 0.e8cacbda snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1242924679 0 0) 0x332b480 con 0x362adc0
2014-05-12 03:13:14.500879 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.500993 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 86 ==== osd_op(mds.0.1:86 10000000053.00000000 [delete] 0.7e2ec456 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2313742898 0 0) 0x332b6c0 con 0x362adc0
2014-05-12 03:13:14.501057 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501172 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 87 ==== osd_op(mds.0.1:87 10000000055.00000000 [delete] 0.aa4eaeb2 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (796757021 0 0) 0x35beb40 con 0x362adc0
2014-05-12 03:13:14.501220 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501357 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 88 ==== osd_op(mds.0.1:88 10000000057.00000000 [delete] 0.fbd03b2e snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2209893053 0 0) 0x35be900 con 0x362adc0
2014-05-12 03:13:14.501406 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501517 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 89 ==== osd_op(mds.0.1:89 10000000059.00000000 [delete] 0.4b440ff1 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (1029022957 0 0) 0x35be6c0 con 0x362adc0
2014-05-12 03:13:14.501581 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3
2014-05-12 03:13:14.501692 7fbc39415700  1 -- 10.214.131.2:6810/20680 <== mds.0 10.214.131.19:6815/25798 90 ==== osd_op(mds.0.1:90 1000000005b.00000000 [delete] 0.e46123f9 snapc 1=[] ondisk+write e3) v4 ==== 171+0+0 (2287437818 0 0) 0x35be480 con 0x362adc0
2014-05-12 03:13:14.501749 7fbc39415700 20 osd.2 5 should_share_map mds.0 10.214.131.19:6815/25798 3

Best I can tell, we only get that output (at debug_osd = 20! and neither a response nor a log message) if the PG doesn't have a primary, or if we queued it for a PG we expect to be created.
However, these logs are from 13 hours ago, and the OSD is on epoch 5, while the messages were sent at epoch 3. There is no check that ops should still go to us, if they were targeted correctly at send time.

The MDS thinks it has 21 ops in flight, and the kclient has 3 or something, against the 86 the OSD has waiting. We just need to check against the current map, if the op should actually be targeted at us.


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #8404: OSD: stopped doing useful work until scrub timeouts nudged PGsDuplicate05/20/2014

Actions
Actions #1

Updated by Greg Farnum almost 10 years ago

  • Description updated (diff)
Actions #2

Updated by Greg Farnum almost 10 years ago

  • Status changed from New to 7
  • Assignee changed from Greg Farnum to Samuel Just

wip-8338. It passes trivial tests (local cluster, rados bench); Sam said he'd run it through testing with some changes of his.

Actions #3

Updated by Greg Farnum almost 10 years ago

  • Status changed from 7 to Resolved

Merged to master by commit 1383b649d7ae97c99e9840c42bef0c0db5a0f65e as commit 9f0825ca13320187ee9d763160ea2f49738f83f2

Actions

Also available in: Atom PDF