Bug #17831
osd: ENOENT on clone
0%
Description
-1542> 2016-11-08 21:42:03.424076 7f6da6f54700 10 filestore(/var/lib/ceph/osd/ceph-1) clone_range 1.7a_head/#1:5fe289c3:::smithi06215392-46:127# -> 1.7a_head/#1:5fe289c3:::smithi06215392-46:123# 0~3663992 to 0 = -2 -1541> 2016-11-08 21:42:03.424081 7f6da6f54700 -1 filestore(/var/lib/ceph/osd/ceph-1) error (2) No such file or directory not handled on operation 0x7f6dbf9b9a30 (13361.0.6, or op 6, counting from 0) -1540> 2016-11-08 21:42:03.424084 7f6da6f54700 0 filestore(/var/lib/ceph/osd/ceph-1) ENOENT on clone suggests osd bug ... -1509> 2016-11-08 21:42:03.424086 7f6da6f54700 0 filestore(/var/lib/ceph/osd/ceph-1) transaction dump: { "ops": [ { "op_num": 0, "op_name": "remove", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#" }, { "op_num": 1, "op_name": "touch", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#" }, { "op_num": 2, "op_name": "truncate", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#", "offset": 3663992 }, { "op_num": 3, "op_name": "omap_setheader", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#", "header_length": "0" }, { "op_num": 4, "op_name": "op_setallochint", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#", "expected_object_size": "0", "expected_write_size": "0" }, { "op_num": 5, "op_name": "setattrs", "collection": "1.7a_head", "oid": "#1:5fe289c3:::smithi06215392-46:123#", "attr_lens": { "_": 289, "__header": 58 } }, { "op_num": 6, "op_name": "clonerange2", "collection": "1.7a_head", "src_oid": "#1:5fe289c3:::smithi06215392-46:127#", "dst_oid": "#1:5fe289c3:::smithi06215392-46:123#", "src_offset": 0, "len": 3663992, "dst_offset": 0 }, { "op_num": 7, "op_name": "omap_setkeys", "collection": "meta", "oid": "#-1:c0371625:::snapmapper:0#", "attr_lens": { "OBJ_0000000000000001.AF74193C.123.smithi06215392-46..": 106 } }, { "op_num": 8, "op_name": "omap_setkeys", "collection": "meta", "oid": "#-1:c0371625:::snapmapper:0#", "attr_lens": { "MAP_0000000000000112_0000000000000001.AF74193C.123.smithi06215392-46..": 70, "MAP_0000000000000113_0000000000000001.AF74193C.123.smithi06215392-46..": 70, "MAP_0000000000000116_0000000000000001.AF74193C.123.smithi06215392-46..": 70, "MAP_000000000000011E_0000000000000001.AF74193C.123.smithi06215392-46..": 70, "MAP_0000000000000123_0000000000000001.AF74193C.123.smithi06215392-46..": 70 } }, { "op_num": 9, "op_name": "omap_rmkeys", "collection": "1.7a_head", "oid": "#1:5e000000::::head#" }, { "op_num": 10, "op_name": "omap_setkeys", "collection": "1.7a_head", "oid": "#1:5e000000::::head#", "attr_lens": { "_info": 855 } } ] }/a/sage-2016-11-08_20:40:20-rados:thrash-wip-sage-testing---basic-smithi/532700
Related issues
History
#1 Updated by Xinze Chi about 6 years ago
The recovering object may use the snap object which is generate by clone op to complete the recovery process. But the snap object may be not applied to store by remote peer. Because the recovery op priority is high than clone op.
So the remote peer may recovery the object first and then do clone op.
#2 Updated by Samuel Just about 6 years ago
- Assignee set to David Zafman
David: can you take a look?
#3 Updated by Samuel Just about 6 years ago
- Related to Bug #15774: osd_op_queue_cut_off osd_op_queue debug_random generate assert failure. added
#4 Updated by Samuel Just about 6 years ago
- Priority changed from Urgent to Immediate
#5 Updated by Samuel Just about 6 years ago
- Assignee changed from David Zafman to Samuel Just
#6 Updated by Samuel Just about 6 years ago
- Priority changed from Immediate to High
Xinze Chi's diagnosis is entirely correct. This is actually pretty simple, just need to modify ReplicatedBackend to take RWState locks on clone sources. No need for fancy blocking either, we always have the option of simply not using that clone.
#7 Updated by Samuel Just about 6 years ago
- Status changed from New to 7
#8 Updated by Samuel Just about 6 years ago
- Status changed from 7 to In Progress
https://github.com/athanatos/ceph/tree/wip-17831 -- it'll be a little bit before I have time to test and debug it.
#9 Updated by Samuel Just about 6 years ago
- Duplicated by Bug #18373: osd: repop vs push race added
#10 Updated by Samuel Just about 6 years ago
- Priority changed from High to Immediate
Sage saw this again, see http://tracker.ceph.com/issues/18373
#11 Updated by Samuel Just about 6 years ago
- Status changed from In Progress to 7
#12 Updated by Samuel Just about 6 years ago
- Status changed from 7 to Fix Under Review
#13 Updated by Samuel Just about 6 years ago
#14 Updated by Alexey Sheplyakov about 6 years ago
I guess the fix should be backported to jewel since it does not lock the clone source.
#15 Updated by Sage Weil about 6 years ago
- Status changed from Fix Under Review to Pending Backport
#16 Updated by Sage Weil about 6 years ago
- Backport set to kraken,jewel
#17 Updated by Alexey Sheplyakov about 6 years ago
- Copied to Backport #18581: jewel: osd: ENOENT on clone added
#18 Updated by Samuel Just about 6 years ago
- Related to Bug #18583: osd: calc_clone_subsets misuses try_read_lock vs missing added
#19 Updated by Nathan Cutler about 6 years ago
- Copied to Backport #18610: kraken: osd: ENOENT on clone added
#20 Updated by Samuel Just almost 6 years ago
- Related to Bug #18809: FAILED assert(object_contexts.empty()) (live on master only from Jan-Feb 2017, all other instances are different) added
#21 Updated by Samuel Just almost 6 years ago
- Status changed from Pending Backport to Resolved
- Backport deleted (
kraken,jewel)
http://tracker.ceph.com/issues/18927 and http://tracker.ceph.com/issues/18809 were caused by this series, I don't think we should backport it.