Bug #8694
closedOSD crashed (assertion failure) at FileStore::_collection_move_rename
0%
Description
Most recently when the cluster was doing backfilling/recovery, we captured one OSD crash at FileStore::_collection_move_rename, following is the full backtrace:
No symbol table info available. #10 0x00000000009ed129 in ceph::__ceph_assert_fail (assertion=0x1d5c230 "\001", file=0xd23c5b0 "\320\300#\r", line=4454, func=0xbc4280 "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)") at common/assert.cc:77 tss = <incomplete type> buf = "os/FileStore.cc: In function 'int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)' thread 7fb1ea55a700 time 2014-06-27 13:19:35.06167"... bt = 0x6174a80 oss = <incomplete type> #11 0x00000000008eec8f in FileStore::_collection_move_rename (this=0x1d78000, oldcid=..., oldoid=..., c=..., o=..., spos=...) at os/FileStore.cc:4454 fd = std::tr1::shared_ptr (empty) 0x0 __func__ = "_collection_move_rename" srccmp = -2 __PRETTY_FUNCTION__ = "int FileStore::_collection_move_rename(coll_t, const ghobject_t&, coll_t, const ghobject_t&, const SequencerPosition&)" r = -2 dstcmp = 1 #12 0x00000000008f3579 in FileStore::_do_transaction (this=0x1d78000, t=..., op_seq=<value optimized out>, trans_num=<value optimized out>, handle=0x7fb1ea559cb0) at os/FileStore.cc:2349 oldcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"} oldoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, generation = 18446744073709551615, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615} newcid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = "meta"}, str = "3.5bfs6_head"} newoid = {hobj = {oid = {name = "default.5470.715__shadow_.KMVmfZ4wW3C8q_0UB_DIxAF-4HnzJ61_1"}, snap = {val = 18446744073709551614}, hash = 960337343, max = false, static POOL_IS_TEMP = -1, pool = 3, nspace = "", key = ""}, generation = 13947, shard_id = 6 '\006', static NO_SHARD = 255 '\377', static NO_GEN = 18446744073709551615} op = 38 r = 0 i = {p = {bl = 0x34f550a0, ls = 0x34f550a0, off = 2516, p = {_raw = , _off = 78260816, _len = 0}, p_off = 0}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, _tolerate_collection_add_enoent = false} spos = {seq = 5351696, trans = 0, op = 6} ---Type <return> to continue, or q <return> to quit--- __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)" #13 0x00000000008fab34 in FileStore::_do_transactions (this=0x1d78000, tls=std::list = {...}, op_seq=5351696, handle=0x7fb1ea559cb0) at os/FileStore.cc:1868 p = <value optimized out> r = <value optimized out> bytes = <value optimized out> ops = <value optimized out> trans_num = <value optimized out> #14 0x00000000008fade1 in FileStore::_do_op (this=0x1d78000, osr=0x2f5b37a0, handle=...) at os/FileStore.cc:1698 o = 0x32d65130 r = <value optimized out> #15 0x0000000000a8c301 in ThreadPool::worker (this=0x1d78d90, wt=0x1da6de0) at common/WorkQueue.cc:125 tp_handle = {cct = 0x1d5c230, hb = 0x1db6630, grace = 60, suicide_grace = 180} item = 0x2f5b37a0 wq = 0x1d78f18 did = false ss = <incomplete type> hb = 0x1db6630 #16 0x0000000000a8f340 in ThreadPool::WorkThread::entry (this=<value optimized out>) at common/WorkQueue.h:317 No locals.
The log showed that it tried to open an non-existing file which led to the crash we observed, there was not much verbose log captured during the time.
More information:
1. the pool is using EC
2. Ceph version: ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
3. Restarting the OSD worked with no crashing anymore
Updated by Greg Farnum almost 10 years ago
- Project changed from rgw to Ceph
- Category set to OSD
- Priority changed from Normal to High
Can you print the value of "r" in the _collection_move_rename frame?
Do you have a full OSD log from when this happened? Have you seen it more than once?
The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?
Updated by Guang Yang almost 10 years ago
Greg Farnum wrote:
Can you print the value of "r" in the _collection_move_rename frame?
From the backtrace above, 'r' in the context has value of '-2' which means 'No such file or directory'.
Do you have a full OSD log from when this happened? Have you seen it more than once?
Sadly we didn't have verbosed log when the crash happened, and so far we only saw it once, restarting helped and we never saw it afterwards.
The OSD tried to do a rename and got an error when trying to open the object to move, and then asserted out because it only allows that during replay (for ENOENT, because it already got moved). Did you look at dmesg to see if there were any warnings?
Sorry I didn't check dmesg and I just checked against system log (/var/log/all) at the time and didn't find anything usual.
Updated by Guang Yang almost 10 years ago
I am not sure if this bug is related with http://tracker.ceph.com/issues/8733, but the failure pattern is quite similar. Link it here for FYI.
We will also run tests once we bring in the fix for http://tracker.ceph.com/issues/8733 and if we come across this failure again, I will drop a comment.
Updated by Sage Weil over 9 years ago
- Status changed from New to Duplicate