Project

General

Profile

Actions

Bug #8694

closed

OSD crashed (assertion failure) at FileStore::_collection_move_rename

Added by Guang Yang almost 10 years ago. Updated over 9 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Most recently when the cluster was doing backfilling/recovery, we captured one OSD crash at FileStore::_collection_move_rename, following is the full backtrace:

No symbol table info available.
#10 0x00000000009ed129 in ceph::__ceph_assert_fail (assertion=0x1d5c230 "\001", file=0xd23c5b0 "\320\300#\r", line=4454, 
    func=0xbc4280 "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)") at common/assert.cc:77
        tss = <incomplete type>
        buf = "os/FileStore.cc: In function 'int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)' thread 7fb1ea55a700 time 2014-06-27 13:19:35.06167"...
        bt = 0x6174a80
        oss = <incomplete type>
#11 0x00000000008eec8f in FileStore::_collection_move_rename (this=0x1d78000, oldcid=..., oldoid=..., c=..., o=..., spos=...) at os/FileStore.cc:4454
        fd = std::tr1::shared_ptr (empty) 0x0
        __func__ = "_collection_move_rename" 
        srccmp = -2
        __PRETTY_FUNCTION__ = "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)" 
        r = -2
        dstcmp = 1
#12 0x00000000008f3579 in FileStore::_do_transaction (this=0x1d78000, t=..., op_seq=<value optimized out>, trans_num=<value optimized out>, handle=0x7fb1ea559cb0) at os/FileStore.cc:2349
        oldcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"}
        oldoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, 
          generation = 18446744073709551615, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615}
        newcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"}
        newoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, 
          generation = 13947, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615}
        op = 38
        r = 0
        i = {p = {bl = 0x34f550a0, ls = 0x34f550a0, off = 2516, p = {_raw = , _off = 78260816, _len = 0}, p_off = 0}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, 
          _tolerate_collection_add_enoent = false}
        spos = {seq = 5351696, trans = 0, op = 6}
---Type <return> to continue, or q <return> to quit---      
        __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)" 
#13 0x00000000008fab34 in FileStore::_do_transactions (this=0x1d78000, tls=std::list = {...}, op_seq=5351696, handle=0x7fb1ea559cb0) at os/FileStore.cc:1868
        p = <value optimized out>
        r = <value optimized out>
        bytes = <value optimized out>
        ops = <value optimized out>
        trans_num = <value optimized out>
#14 0x00000000008fade1 in FileStore::_do_op (this=0x1d78000, osr=0x2f5b37a0, handle=...) at os/FileStore.cc:1698
        o = 0x32d65130
        r = <value optimized out>
#15 0x0000000000a8c301 in ThreadPool::worker (this=0x1d78d90, wt=0x1da6de0) at common/WorkQueue.cc:125
        tp_handle = {cct = 0x1d5c230, hb = 0x1db6630, grace = 60, suicide_grace = 180}
        item = 0x2f5b37a0
        wq = 0x1d78f18
        did = false
        ss = <incomplete type>
        hb = 0x1db6630
#16 0x0000000000a8f340 in ThreadPool::WorkThread::entry (this=<value optimized out>) at common/WorkQueue.h:317
No locals.

The log showed that it tried to open an non-existing file which led to the crash we observed, there was not much verbose log captured during the time.

More information:
1. the pool is using EC
2. Ceph version: ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
3. Restarting the OSD worked with no crashing anymore


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #8733: OSD crashed at void ECBackend::handle_sub_readResolved07/02/2014

Actions
Actions #1

Updated by Greg Farnum almost 10 years ago

  • Project changed from rgw to Ceph
  • Category set to OSD
  • Priority changed from Normal to High

Can you print the value of "r" in the _collection_move_rename frame?

Do you have a full OSD log from when this happened? Have you seen it more than once?
The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?

Actions #2

Updated by Guang Yang almost 10 years ago

Greg Farnum wrote:

Can you print the value of "r" in the _collection_move_rename frame?

From the backtrace above, 'r' in the context has value of '-2' which means 'No such file or directory'.

Do you have a full OSD log from when this happened? Have you seen it more than once?

Sadly we didn't have verbosed log when the crash happened, and so far we only saw it once, restarting helped and we never saw it afterwards.

The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?

Sorry I didn't check dmesg and I just checked against system log (/var/log/all) at the time and didn't find anything usual.

Actions #3

Updated by Guang Yang almost 10 years ago

I am not sure if this bug is related with http://tracker.ceph.com/issues/8733, but the failure pattern is quite similar. Link it here for FYI.

We will also run tests once we bring in the fix for http://tracker.ceph.com/issues/8733 and if we come across this failure again, I will drop a comment.

Actions #4

Updated by Samuel Just almost 10 years ago

This is probably a dup of 8733.

Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF