Actions
Bug #8694
closedOSD crashed (assertion failure) at FileStore::_collection_move_rename
Status:
Duplicate
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Most recently when the cluster was doing backfilling/recovery, we captured one OSD crash at FileStore::_collection_move_rename, following is the full backtrace:
No symbol table info available. #10 0x00000000009ed129 in ceph::__ceph_assert_fail (assertion=0x1d5c230 "\001", file=0xd23c5b0 "\320\300#\r", line=4454, func=0xbc4280 "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)") at common/assert.cc:77 tss = <incomplete type> buf = "os/FileStore.cc: In function 'int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)' thread 7fb1ea55a700 time 2014-06-27 13:19:35.06167"... bt = 0x6174a80 oss = <incomplete type> #11 0x00000000008eec8f in FileStore::_collection_move_rename (this=0x1d78000, oldcid=..., oldoid=..., c=..., o=..., spos=...) at os/FileStore.cc:4454 fd = std::tr1::shared_ptr (empty) 0x0 __func__ = "_collection_move_rename" srccmp = -2 __PRETTY_FUNCTION__ = "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)" r = -2 dstcmp = 1 #12 0x00000000008f3579 in FileStore::_do_transaction (this=0x1d78000, t=..., op_seq=<value optimized out>, trans_num=<value optimized out>, handle=0x7fb1ea559cb0) at os/FileStore.cc:2349 oldcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"} oldoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, generation = 18446744073709551615, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615} newcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"} newoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, generation = 13947, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615} op = 38 r = 0 i = {p = {bl = 0x34f550a0, ls = 0x34f550a0, off = 2516, p = {_raw = , _off = 78260816, _len = 0}, p_off = 0}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, _tolerate_collection_add_enoent = false} spos = {seq = 5351696, trans = 0, op = 6} ---Type <return> to continue, or q <return> to quit--- __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)" #13 0x00000000008fab34 in FileStore::_do_transactions (this=0x1d78000, tls=std::list = {...}, op_seq=5351696, handle=0x7fb1ea559cb0) at os/FileStore.cc:1868 p = <value optimized out> r = <value optimized out> bytes = <value optimized out> ops = <value optimized out> trans_num = <value optimized out> #14 0x00000000008fade1 in FileStore::_do_op (this=0x1d78000, osr=0x2f5b37a0, handle=...) at os/FileStore.cc:1698 o = 0x32d65130 r = <value optimized out> #15 0x0000000000a8c301 in ThreadPool::worker (this=0x1d78d90, wt=0x1da6de0) at common/WorkQueue.cc:125 tp_handle = {cct = 0x1d5c230, hb = 0x1db6630, grace = 60, suicide_grace = 180} item = 0x2f5b37a0 wq = 0x1d78f18 did = false ss = <incomplete type> hb = 0x1db6630 #16 0x0000000000a8f340 in ThreadPool::WorkThread::entry (this=<value optimized out>) at common/WorkQueue.h:317 No locals.
The log showed that it tried to open an non-existing file which led to the crash we observed, there was not much verbose log captured during the time.
More information:
1. the pool is using EC
2. Ceph version: ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
3. Restarting the OSD worked with no crashing anymore
Actions