Bug #8694: OSD crashed (assertion failure) at FileStore::_collection_move_rename - Ceph - Ceph

Actions

Copy link

Bug #8694

closed

OSD crashed (assertion failure) at FileStore::_collection_move_rename

Added by Guang Yang almost 10 years ago. Updated over 9 years ago.

Status:

Duplicate

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Most recently when the cluster was doing backfilling/recovery, we captured one OSD crash at FileStore::_collection_move_rename, following is the full backtrace:

No symbol table info available.
#10 0x00000000009ed129 in ceph::__ceph_assert_fail (assertion=0x1d5c230 "\001", file=0xd23c5b0 "\320\300#\r", line=4454, 
    func=0xbc4280 "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)") at common/assert.cc:77
        tss = <incomplete type>
        buf = "os/FileStore.cc: In function 'int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)' thread 7fb1ea55a700 time 2014-06-27 13:19:35.06167"...
        bt = 0x6174a80
        oss = <incomplete type>
#11 0x00000000008eec8f in FileStore::_collection_move_rename (this=0x1d78000, oldcid=..., oldoid=..., c=..., o=..., spos=...) at os/FileStore.cc:4454
        fd = std::tr1::shared_ptr (empty) 0x0
        __func__ = "_collection_move_rename" 
        srccmp = -2
        __PRETTY_FUNCTION__ = "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)" 
        r = -2
        dstcmp = 1
#12 0x00000000008f3579 in FileStore::_do_transaction (this=0x1d78000, t=..., op_seq=<value optimized out>, trans_num=<value optimized out>, handle=0x7fb1ea559cb0) at os/FileStore.cc:2349
        oldcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"}
        oldoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, 
          generation = 18446744073709551615, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615}
        newcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"}
        newoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, 
          generation = 13947, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615}
        op = 38
        r = 0
        i = {p = {bl = 0x34f550a0, ls = 0x34f550a0, off = 2516, p = {_raw = , _off = 78260816, _len = 0}, p_off = 0}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, 
          _tolerate_collection_add_enoent = false}
        spos = {seq = 5351696, trans = 0, op = 6}
---Type <return> to continue, or q <return> to quit---      
        __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)" 
#13 0x00000000008fab34 in FileStore::_do_transactions (this=0x1d78000, tls=std::list = {...}, op_seq=5351696, handle=0x7fb1ea559cb0) at os/FileStore.cc:1868
        p = <value optimized out>
        r = <value optimized out>
        bytes = <value optimized out>
        ops = <value optimized out>
        trans_num = <value optimized out>
#14 0x00000000008fade1 in FileStore::_do_op (this=0x1d78000, osr=0x2f5b37a0, handle=...) at os/FileStore.cc:1698
        o = 0x32d65130
        r = <value optimized out>
#15 0x0000000000a8c301 in ThreadPool::worker (this=0x1d78d90, wt=0x1da6de0) at common/WorkQueue.cc:125
        tp_handle = {cct = 0x1d5c230, hb = 0x1db6630, grace = 60, suicide_grace = 180}
        item = 0x2f5b37a0
        wq = 0x1d78f18
        did = false
        ss = <incomplete type>
        hb = 0x1db6630
#16 0x0000000000a8f340 in ThreadPool::WorkThread::entry (this=<value optimized out>) at common/WorkQueue.h:317
No locals.

The log showed that it tried to open an non-existing file which led to the crash we observed, there was not much verbose log captured during the time.

More information:
1. the pool is using EC
2. Ceph version: ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
3. Restarting the OSD worked with no crashing anymore

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Greg Farnum almost 10 years ago

Project changed from rgw to Ceph
Category set to OSD
Priority changed from Normal to High

Can you print the value of "r" in the _collection_move_rename frame?

Do you have a full OSD log from when this happened? Have you seen it more than once?
The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?

Actions

Copy link

Updated by Guang Yang almost 10 years ago

Greg Farnum wrote:

Can you print the value of "r" in the _collection_move_rename frame?

From the backtrace above, 'r' in the context has value of '-2' which means 'No such file or directory'.

Do you have a full OSD log from when this happened? Have you seen it more than once?

Sadly we didn't have verbosed log when the crash happened, and so far we only saw it once, restarting helped and we never saw it afterwards.

The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?

Sorry I didn't check dmesg and I just checked against system log (/var/log/all) at the time and didn't find anything usual.

Actions

Copy link

Updated by Guang Yang almost 10 years ago

I am not sure if this bug is related with http://tracker.ceph.com/issues/8733, but the failure pattern is quite similar. Link it here for FYI.

We will also run tests once we bring in the fix for http://tracker.ceph.com/issues/8733 and if we come across this failure again, I will drop a comment.

Actions

Copy link