Project

General

Profile

Actions

Bug #22544

closed

objecter cannot resend split-dropped op when racing with con reset

Added by mingxin liu over 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


if (split && con && con->has_features(CEPH_FEATUREMASK_RESEND_ON_SPLIT)) {
return RECALC_OP_TARGET_NEED_RESEND;
}

resending depends on con features, if con was just reset, its feature bits is empty, letting this op sneaks.
further more, if this op was resent finally after some new writes(it can happen because acting changed, con reset again..)
, causing out of order.

shall we move objecter resend logic from ms_handle_reset to ms_handle_connect?


Related issues 3 (0 open3 closed)

Has duplicate RADOS - Bug #23402: objecter: does not resend op on split intervalDuplicate03/19/2018

Actions
Copied to RADOS - Backport #35843: mimic: objecter cannot resend split-dropped op when racing with con resetResolvedNathan CutlerActions
Copied to RADOS - Backport #35844: luminous: objecter cannot resend split-dropped op when racing with con resetResolvedPrashant DActions
Actions #1

Updated by John Spray over 6 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Josh Durgin over 6 years ago

  • Priority changed from Normal to Urgent
Actions #3

Updated by Sage Weil over 6 years ago

  • Status changed from New to 12

Hmm, I'm not sure what the best fix is. Do you see a good path to fixing this with ms_handle_connect()?

Actions #4

Updated by Sage Weil over 5 years ago

Here, it happened:

2018-08-31 20:50:46.286 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 155 ==== osd_op(client.4338.0:9206 2.5s0 2.60bf6c05 (undecoded) ondisk+write+known_if_redirected e84) v8 ==== 526+0+114688 (2347694735 0 2580345875) 0x558626969a00 con 0x558626eeb100
2018-08-31 20:50:46.286 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 156 ==== osd_op(client.4338.0:9207 2.5s0 2.60bf6c05 (undecoded) ondisk+write+known_if_redirected e84) v8 ==== 526+0+363 (2433957676 0 3325077087) 0x558626968000 con 0x558626eeb100
2018-08-31 20:50:46.286 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 157 ==== osd_op(client.4338.0:9208 2.5s0 2.60bf6c05 (undecoded) ondisk+read+rwordered+known_if_redirected e84) v8 ==== 526+0+0 (1514364427 0 0) 0x558625722080 con 0x558626eeb100
2018-08-31 20:50:46.342 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 1 ==== osd_op(client.4338.0:9226 2.5s0 2.9d089415 (undecoded) ondisk+retry+write+known_if_redirected e84) v8 ==== 263+0+614400 (405944600 0 301684070) 0x558627126080 con 0x5586261f8700
2018-08-31 20:50:46.342 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 2 ==== osd_op(client.4338.0:9227 2.15s0 2.9d089415 (undecoded) ondisk+write+known_if_redirected e85) v8 ==== 225+0+442368 (4173242535 0 1877614199) 0x558626cdda00 con 0x5586261f8700
2018-08-31 20:50:46.342 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 3 ==== osd_op(client.4338.0:9228 2.15s0 2.9d089415 (undecoded) ondisk+write+known_if_redirected e85) v8 ==== 225+0+360448 (2218567718 0 4227264512) 0x5586277d1040 con 0x5586261f8700
2018-08-31 20:50:46.342 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 4 ==== osd_op(client.4338.0:9229 2.15s0 2.9d089415 (undecoded) ondisk+write+known_if_redirected e85) v8 ==== 225+0+62 (3421199549 0 11496458) 0x5586271ec340 con 0x5586261f8700
2018-08-31 20:50:46.342 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 5 ==== osd_op(client.4338.0:9230 2.15s0 2.9d089415 (undecoded) ondisk+read+rwordered+known_if_redirected e85) v8 ==== 225+0+0 (118536143 0 0) 0x5586271ecd00 con 0x5586261f8700
2018-08-31 20:50:48.514 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 6 ==== osd_op(client.4338.0:9247 2.bs0 2.9faadfbb (undecoded) ondisk+write+known_if_redirected e86) v8 ==== 527+0+770048 (4084052600 0 914887116) 0x558627ff2680 con 0x5586261f8700
2018-08-31 20:50:48.518 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 7 ==== osd_op(client.4338.0:9248 2.bs0 2.9faadfbb (undecoded) ondisk+write+known_if_redirected e86) v8 ==== 527+0+655360 (3769274892 0 2961804097) 0x558627ff29c0 con 0x5586261f8700
2018-08-31 20:50:48.530 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 1 ==== osd_op(client.4338.0:9226 2.15s0 2.9d089415 (undecoded) ondisk+retry+write+known_if_redirected e86) v8 ==== 263+0+614400 (3320940746 0 301684070) 0x558626f24680 con 0x558626f99500
2018-08-31 20:50:48.530 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 2 ==== osd_op(client.4338.0:9247 2.bs0 2.9faadfbb (undecoded) ondisk+retry+write+known_if_redirected e86) v8 ==== 527+0+770048 (343352739 0 914887116) 0x558627ff2d00 con 0x558626f99500
2018-08-31 20:50:48.534 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 3 ==== osd_op(client.4338.0:9248 2.bs0 2.9faadfbb (undecoded) ondisk+retry+write+known_if_redirected e86) v8 ==== 527+0+655360 (128976343 0 2961804097) 0x558625eb4a40 con 0x558626f99500
2018-08-31 20:50:48.534 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 4 ==== osd_op(client.4338.0:9249 2.bs0 2.9faadfbb (undecoded) ondisk+write+known_if_redirected e86) v8 ==== 527+0+147456 (2079555912 0 2676445555) 0x558627ff36c0 con 0x558626f99500
2018-08-31 20:50:48.534 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 5 ==== osd_op(client.4338.0:9250 2.bs0 2.9faadfbb (undecoded) ondisk+write+known_if_redirected e86) v8 ==== 527+0+364 (566634381 0 2994847827) 0x558627ff4080 con 0x558626f99500
2018-08-31 20:50:48.534 7fa1d0e6e700  1 -- 172.21.15.135:6800/10478 <== client.4338 172.21.15.62:58820/2203252198 6 ==== osd_op(client.4338.0:9251 2.bs0 2.9faadfbb (undecoded) ondisk+read+rwordered+known_if_redirected e86) v8 ==== 527+0+0 (4294433175 0 0) 0x558627ff4a40 con 0x558626f99500

notice 9226 and 9227 pgids and osd epochs.

/a/sage-2018-08-31_18:31:48-rados-wip-sage-testing-2018-08-31-1010-distro-basic-smithi/2964779

Actions #5

Updated by Sage Weil over 5 years ago

  • Backport set to mimic,luminous
Actions #6

Updated by Sage Weil over 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #7

Updated by Kefu Chai over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #8

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #35843: mimic: objecter cannot resend split-dropped op when racing with con reset added
Actions #9

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #35844: luminous: objecter cannot resend split-dropped op when racing with con reset added
Actions #10

Updated by Nathan Cutler over 5 years ago

  • Status changed from Pending Backport to Resolved
Actions #11

Updated by Greg Farnum over 4 years ago

  • Has duplicate Bug #23402: objecter: does not resend op on split interval added
Actions

Also available in: Atom PDF