Project

General

Profile

Actions

Bug #18366

closed

cache tiering: base pool last_force_resend not respected (racing read got wrong version)

Added by Sage Weil over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
kraken, jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/sage-2016-12-28_23:29:38-rados-wip-sage-testing---basic-smithi/673129

remote/smithi053/log/ceph-osd.1.log:2016-12-29 01:11:09.646003 7fa73be07700  1 -- 172.21.15.53:6804/25434 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34303 smithi02126561-7602 [write 550007~604296,stat] v61'5658 uv6288 ondisk = 0) v7 -- 0x7fa7700023c0 con 0
remote/smithi021/log/ceph-osd.5.log:2016-12-29 01:11:09.650247 7f24b0c81700  1 -- 172.21.15.21:6808/25814 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34303 smithi02126561-7602 [write 550007~604296,stat] v61'2643 uv6236 ondisk = 0) v7 -- 0x7f24e4bb6b00 con 0
remote/smithi021/log/ceph-osd.5.log:2016-12-29 01:11:09.663276 7f24b0c81700  1 -- 172.21.15.21:6808/25814 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34304 smithi02126561-7602 [write 1774455~751016] v61'2644 uv6237 ondisk = 0) v7 -- 0x7f24e94d7700 con 0
remote/smithi021/log/ceph-osd.5.log:2016-12-29 01:11:09.679605 7f24b0c81700  1 -- 172.21.15.21:6808/25814 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34305 smithi02126561-7602 [write 3215698~534309] v61'2645 uv6238 ondisk = 0) v7 -- 0x7f24e57bd9c0 con 0
remote/smithi021/log/ceph-osd.5.log:2016-12-29 01:11:09.680446 7f24b3486700  1 -- 172.21.15.21:6808/25814 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34306 smithi02126561-7602 [setxattr _header (58),truncate 3750007] v61'2646 uv6239 ondisk = 0) v7 -- 0x7f24e47b9e40 con 0
remote/smithi021/log/ceph-osd.5.log:2016-12-29 01:11:09.681370 7f24b0c81700  1 -- 172.21.15.21:6808/25814 --> 172.21.15.21:0/1021185627 -- osd_op_reply(34307 smithi02126561-7602 [read 0~1] v0'0 uv6239 ondisk = 0) v7 -- 0x7f24eb646b00 con 0

note the 2 replies for 34303 with differing uv, and the test got
2016-12-29T01:11:58.646 INFO:tasks.rados.rados.0.smithi021.stdout:7602:  finishing write tid 1 to smithi02126561-7602
2016-12-29T01:11:58.648 INFO:tasks.rados.rados.0.smithi021.stdout:7602:  finishing write tid 2 to smithi02126561-7602
2016-12-29T01:11:58.650 INFO:tasks.rados.rados.0.smithi021.stdout:7602:  finishing write tid 3 to smithi02126561-7602
2016-12-29T01:11:58.652 INFO:tasks.rados.rados.0.smithi021.stdout:7602:  finishing write tid 5 to smithi02126561-7602
2016-12-29T01:11:58.654 INFO:tasks.rados.rados.0.smithi021.stdout:7602:  finishing write tid 6 to smithi02126561-7602
2016-12-29T01:11:58.656 INFO:tasks.rados.rados.0.smithi021.stdout:update_object_version oid 7602 v 6288 (ObjNum 7601 snap 0 seq_num 7601) dirty exists
2016-12-29T01:11:58.657 INFO:tasks.rados.rados.0.smithi021.stderr:Error: racing read on 7602 returned version 6239 rather than version 6288
2016-12-29T01:11:58.659 INFO:tasks.rados.rados.0.smithi021.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6328-gc0967ae/rpm/el7/BUILD/ceph-11.1.0-6328-gc0967ae/src/test/osd/RadosModel.h: In function 'virtual voi
d WriteOp::_finish(TestOp::CallbackInfo*)' thread 7f94a6ffd700 time 2016-12-29 01:11:10.057739
2016-12-29T01:11:58.661 INFO:tasks.rados.rados.0.smithi021.stderr:/mnt/jenkins/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.1.0-6328-gc0967ae/rpm/el7/BUILD/ceph-11.1.0-6328-gc0967ae/src/test/osd/RadosModel.h: 896: FAILED assert(0 == 
"racing read got wrong version")

on osd.1,

2016-12-29 01:11:09.637017 7fa73e60c700 20 osd.1 pg_epoch: 61 pg[1.3( v 42'5657 (26'2600,42'5657] local-les=11 n=1873 ec=10 les/c/f 11/11/0 10/10/10) [1,2] r=0 lpr=10 crt=42'5657 lcod 42'5656 mlcod 42'5656 active+clean] do_op: op osd_op(client.4123.0:34303 1.d12f0d43 smithi02126561-7602 [write 550007~604296,stat] snapc 0=[] RETRY=1 ondisk+retry+write+known_if_redirected e61) v7

and on osd.5
2016-12-29 01:11:09.635315 7f24b0c81700 20 osd.5 pg_epoch: 61 pg[2.3( v 42'2641 (0'0,42'2641] local-les=12 n=3 ec=11 les/c/f 12/12/0 11/11/11) [5,2] r=0 lpr=11 crt=42'2641 lcod 42'2640 mlcod 42'2640 active+clean] do_op: op osd_op(client.4123.0:34303 2.d12f0d43 smithi02126561-7602 [write 550007~604296,stat] snapc 0=[] ondisk+write+known_if_redirected e40) v7

note that hte osd.4 request was sent in epoch 40, where's osd.1's was sent in epoch 61.

for the base pool we have
"last_force_op_resend": "60",

why didn't the client resend?

and/or, should the osd discard?


Related issues 2 (0 open2 closed)

Copied to Ceph - Backport #18403: kraken: cache tiering: base pool last_force_resend not respected (racing read got wrong version)ResolvedNathan CutlerActions
Copied to Ceph - Backport #18404: jewel: cache tiering: base pool last_force_resend not respected (racing read got wrong version)ResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil over 7 years ago

  • Subject changed from cache tiering: last_force_resend not always set (racing read got wrong version) to cache tiering: base pool last_force_resend not respected (racing read got wrong version)
  • Description updated (diff)
Actions #2

Updated by Sage Weil over 7 years ago

  • Status changed from 12 to Fix Under Review
Actions #3

Updated by Kefu Chai over 7 years ago

  • Backport set to kraken, jewel
Actions #4

Updated by Kefu Chai over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #18403: kraken: cache tiering: base pool last_force_resend not respected (racing read got wrong version) added
Actions #6

Updated by Nathan Cutler over 7 years ago

  • Copied to Backport #18404: jewel: cache tiering: base pool last_force_resend not respected (racing read got wrong version) added
Actions #7

Updated by Nathan Cutler about 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF