Project

General

Profile

Actions

Bug #17831

closed

osd: ENOENT on clone

Added by Sage Weil over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

 -1542> 2016-11-08 21:42:03.424076 7f6da6f54700 10 filestore(/var/lib/ceph/osd/ceph-1) clone_range 1.7a_head/#1:5fe289c3:::smithi06215392-46:127# -> 1.7a_head/#1:5fe289c3:::smithi06215392-46:123# 0~3663992 to 0 = -2
 -1541> 2016-11-08 21:42:03.424081 7f6da6f54700 -1 filestore(/var/lib/ceph/osd/ceph-1)  error (2) No such file or directory not handled on operation 0x7f6dbf9b9a30 (13361.0.6, or op 6, counting from 0)
 -1540> 2016-11-08 21:42:03.424084 7f6da6f54700  0 filestore(/var/lib/ceph/osd/ceph-1) ENOENT on clone suggests osd bug
...
 -1509> 2016-11-08 21:42:03.424086 7f6da6f54700  0 filestore(/var/lib/ceph/osd/ceph-1)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#" 
        },
        {
            "op_num": 1,
            "op_name": "touch",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#" 
        },
        {
            "op_num": 2,
            "op_name": "truncate",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#",
            "offset": 3663992
        },
        {
            "op_num": 3,
            "op_name": "omap_setheader",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#",
            "header_length": "0" 
        },
        {
            "op_num": 4,
            "op_name": "op_setallochint",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#",
            "expected_object_size": "0",
            "expected_write_size": "0" 
        },
        {
            "op_num": 5,
            "op_name": "setattrs",
            "collection": "1.7a_head",
            "oid": "#1:5fe289c3:::smithi06215392-46:123#",
            "attr_lens": {
                "_": 289,
                "__header": 58
            }
        },
        {
            "op_num": 6,
            "op_name": "clonerange2",
            "collection": "1.7a_head",
            "src_oid": "#1:5fe289c3:::smithi06215392-46:127#",
            "dst_oid": "#1:5fe289c3:::smithi06215392-46:123#",
            "src_offset": 0,
            "len": 3663992,
            "dst_offset": 0
        },
        {
            "op_num": 7,
            "op_name": "omap_setkeys",
            "collection": "meta",
            "oid": "#-1:c0371625:::snapmapper:0#",
            "attr_lens": {
                "OBJ_0000000000000001.AF74193C.123.smithi06215392-46..": 106
            }
        },
        {
            "op_num": 8,
            "op_name": "omap_setkeys",
            "collection": "meta",
            "oid": "#-1:c0371625:::snapmapper:0#",
            "attr_lens": {
                "MAP_0000000000000112_0000000000000001.AF74193C.123.smithi06215392-46..": 70,
                "MAP_0000000000000113_0000000000000001.AF74193C.123.smithi06215392-46..": 70,
                "MAP_0000000000000116_0000000000000001.AF74193C.123.smithi06215392-46..": 70,
                "MAP_000000000000011E_0000000000000001.AF74193C.123.smithi06215392-46..": 70,
                "MAP_0000000000000123_0000000000000001.AF74193C.123.smithi06215392-46..": 70
            }
        },
        {
            "op_num": 9,
            "op_name": "omap_rmkeys",
            "collection": "1.7a_head",
            "oid": "#1:5e000000::::head#" 
        },
        {
            "op_num": 10,
            "op_name": "omap_setkeys",
            "collection": "1.7a_head",
            "oid": "#1:5e000000::::head#",
            "attr_lens": {
                "_info": 855
            }
        }
    ]
}
/a/sage-2016-11-08_20:40:20-rados:thrash-wip-sage-testing---basic-smithi/532700

Related issues 6 (0 open6 closed)

Related to Ceph - Bug #15774: osd_op_queue_cut_off osd_op_queue debug_random generate assert failure.Resolved05/09/2016

Actions
Related to Ceph - Bug #18583: osd: calc_clone_subsets misuses try_read_lock vs missing ResolvedSamuel Just01/18/2017

Actions
Related to Ceph - Bug #18809: FAILED assert(object_contexts.empty()) (live on master only from Jan-Feb 2017, all other instances are different)ResolvedSamuel Just02/03/2017

Actions
Has duplicate Ceph - Bug #18373: osd: repop vs push raceDuplicateSamuel Just12/30/2016

Actions
Copied to Ceph - Backport #18581: jewel: osd: ENOENT on cloneRejectedAlexey SheplyakovActions
Copied to Ceph - Backport #18610: kraken: osd: ENOENT on cloneResolvedNathan CutlerActions
Actions #1

Updated by Xinze Chi over 7 years ago

The recovering object may use the snap object which is generate by clone op to complete the recovery process. But the snap object may be not applied to store by remote peer. Because the recovery op priority is high than clone op.
So the remote peer may recovery the object first and then do clone op.

Actions #2

Updated by Samuel Just over 7 years ago

  • Assignee set to David Zafman

David: can you take a look?

Actions #3

Updated by Samuel Just over 7 years ago

  • Related to Bug #15774: osd_op_queue_cut_off osd_op_queue debug_random generate assert failure. added
Actions #4

Updated by Samuel Just over 7 years ago

  • Priority changed from Urgent to Immediate
Actions #5

Updated by Samuel Just over 7 years ago

  • Assignee changed from David Zafman to Samuel Just
Actions #6

Updated by Samuel Just over 7 years ago

  • Priority changed from Immediate to High

Xinze Chi's diagnosis is entirely correct. This is actually pretty simple, just need to modify ReplicatedBackend to take RWState locks on clone sources. No need for fancy blocking either, we always have the option of simply not using that clone.

Actions #7

Updated by Samuel Just over 7 years ago

  • Status changed from New to 7
Actions #8

Updated by Samuel Just over 7 years ago

  • Status changed from 7 to In Progress

https://github.com/athanatos/ceph/tree/wip-17831 -- it'll be a little bit before I have time to test and debug it.

Actions #9

Updated by Samuel Just over 7 years ago

  • Has duplicate Bug #18373: osd: repop vs push race added
Actions #10

Updated by Samuel Just over 7 years ago

  • Priority changed from High to Immediate

Sage saw this again, see http://tracker.ceph.com/issues/18373

Actions #11

Updated by Samuel Just over 7 years ago

  • Status changed from In Progress to 7
Actions #12

Updated by Samuel Just over 7 years ago

  • Status changed from 7 to Fix Under Review
Actions #14

Updated by Alexey Sheplyakov over 7 years ago

I guess the fix should be backported to jewel since it does not lock the clone source.

Actions #15

Updated by Sage Weil over 7 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #16

Updated by Sage Weil over 7 years ago

  • Backport set to kraken,jewel
Actions #17

Updated by Alexey Sheplyakov over 7 years ago

Actions #18

Updated by Samuel Just over 7 years ago

  • Related to Bug #18583: osd: calc_clone_subsets misuses try_read_lock vs missing added
Actions #19

Updated by Nathan Cutler over 7 years ago

Actions #20

Updated by Samuel Just about 7 years ago

  • Related to Bug #18809: FAILED assert(object_contexts.empty()) (live on master only from Jan-Feb 2017, all other instances are different) added
Actions #21

Updated by Samuel Just about 7 years ago

  • Status changed from Pending Backport to Resolved
  • Backport deleted (kraken,jewel)

http://tracker.ceph.com/issues/18927 and http://tracker.ceph.com/issues/18809 were caused by this series, I don't think we should backport it.

Actions

Also available in: Atom PDF