Project

General

Profile

Actions

Bug #23145

closed

OSD crashes during recovery of EC pg

Added by Peter Woodman about 6 years ago. Updated almost 5 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've got a cluster (running released debs of ceph 12.2.3) that started crashing on OSD startup a little bit ago. I didn't catch how this started happening, but initially 3 PGs would cause crashes when they started recovering. Right now it's just down to one (10.1, using EC w/ jerasure/cauchy_good/k=5/m=2)- when the cluster brings this PG online (happens even with norecover and nobackfill), the OSD crashes. I've tried different combinations of OSDs and they all crash in nearly the same way, only the shard varies. Below is a snippet of the crash:

    -3> 2018-02-27 08:26:03.850288 7f9f5f3b30 -1 bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (2) No such file or directory not handled on operation 30 (op 2, counting from 0)
    -2> 2018-02-27 08:26:03.850361 7f9f5f3b30 -1 bluestore(/var/lib/ceph/osd/ceph-5) ENOENT on clone suggests osd bug
    -1> 2018-02-27 08:26:03.850368 7f9f5f3b30  0 bluestore(/var/lib/ceph/osd/ceph-5)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "10.1s0_head",
            "oid": "0#10:88761799:::100004ecb94.00000000:head#" 
        },
        {
            "op_num": 1,
            "op_name": "truncate",
            "collection": "10.1s0_head",
            "oid": "0#10:91d9aa55:::100004ecb90.00000000:head#",
            "offset": 8454144
        },
        {
            "op_num": 2,
            "op_name": "clonerange2",
            "collection": "10.1s0_head",
            "src_oid": "0#10:91d9aa55:::100004ecb90.00000000:head#489c",
            "dst_oid": "0#10:91d9aa55:::100004ecb90.00000000:head#",
            "src_offset": 8355840,
            "len": 98304,
            "dst_offset": 8355840
        },
        {
            "op_num": 3,
            "op_name": "remove",
            "collection": "10.1s0_head",
            "oid": "0#10:91d9aa55:::100004ecb90.00000000:head#489c" 
        },
        {
            "op_num": 4,
            "op_name": "setattrs",
            "collection": "10.1s0_head",
            "oid": "0#10:91d9aa55:::100004ecb90.00000000:head#",
            "attr_lens": {
                "_": 275,
                "hinfo_key": 18,
                "snapset": 35
            }
        },
        {
            "op_num": 5,
            "op_name": "nop" 
        },
        {
            "op_num": 6,
            "op_name": "op_omap_rmkeyrange",
            "collection": "10.1s0_head",
            "oid": "0#10:80000000::::head#",
            "first": "0000017348.00000000000000018588",
            "last": "4294967295.18446744073709551615" 
        },
        {
            "op_num": 7,
            "op_name": "omap_setkeys",
            "collection": "10.1s0_head",
            "oid": "0#10:80000000::::head#",
            "attr_lens": {
                "_biginfo": 994,
                "_info": 943,
                "can_rollback_to": 12,
                "rollback_info_trimmed_to": 12
            }
        }
    ]
}

     0> 2018-02-27 08:26:03.860031 7f9f5f3b30 -1 /build/ceph-12.2.3/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f9f5f3b30 time 2018-02-27 08:26:03.850692
/build/ceph-12.2.3/src/os/bluestore/BlueStore.cc: 9404: FAILED assert(0 == "unexpected error")

 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xfc) [0x55866086ac]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x11f0) [0x55864dc050]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x408) [0x55864dcf70]
 4: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x158) [0x55860fd870]
 5: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x6c) [0x558608f4f4]
 6: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x3a4) [0x55860b3484]
 7: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x4c) [0x558611ad04]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa44) [0x558660e7d4]
 9: (ThreadPool::WorkThread::entry()+0x14) [0x558660f73c]
 10: (()+0x6fc4) [0x7fb26fcfc4]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

I've uploaded a full log using ceph-post-file with uuid c1b689b0-b813-4e62-bb95-bacc14c376c7 with debug osd set to 20/20. Any clues on what's going on or how to bypass it (even with data loss) would be great- this pool is a cephfs data pool, and I've had no success removing the affected files from the filesystem with the backing PG down, just causes the MDS to get stuck.


Files

log.tar.gz (185 KB) log.tar.gz tao ning, 03/22/2018 11:41 AM

Related issues 2 (0 open2 closed)

Related to RADOS - Bug #24597: FAILED assert(0 == "ERROR: source must exist") in FileStore::_collection_move_rename()ResolvedSage Weil06/20/2018

Actions
Has duplicate RADOS - Bug #24422: Ceph OSDs crashing in BlueStore::queue_transactions() using ECDuplicate06/05/2018

Actions
Actions

Also available in: Atom PDF