Project

General

Profile

Bug #15774

osd_op_queue_cut_off osd_op_queue debug_random generate assert failure.

Added by shawn chen almost 8 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr:os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7fcfc8537700 time 2016-05-05 15:26:58.536466
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr:os/filestore/FileStore.cc: 2912: FAILED assert(0 == "unexpected error")
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: ceph version 10.1.1-58-gaba9023 (aba9023d17e9a8a7eaa32740bc2e0257cbdb27db)
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fcfd75cd9bb]
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xeb5) [0x7fcfd72bd8d5]
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x7fcfd72c367b]
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x2b5) [0x7fcfd72c3965]
2016-05-05T15:26:58.539 INFO:tasks.ceph.osd.4.ott018.stderr: 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x7fcfd75bef1e]
2016-05-05T15:26:58.540 INFO:tasks.ceph.osd.4.ott018.stderr: 6: (ThreadPool::WorkThread::entry()+0x10) [0x7fcfd75bfe00]
2016-05-05T15:26:58.540 INFO:tasks.ceph.osd.4.ott018.stderr: 7: (()+0x8182) [0x7fcfd5a91182]
2016-05-05T15:26:58.540 INFO:tasks.ceph.osd.4.ott018.stderr: 8: (clone()+0x6d) [0x7fcfd3bbf47d]
2016-05-05T15:26:58.540 INFO:tasks.ceph.osd.4.ott018.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Related issues

Related to Ceph - Bug #17831: osd: ENOENT on clone Resolved 11/08/2016

History

#1 Updated by shawn chen almost 8 years ago

main reason caused this is the Op order is wrong because osd_op_queue,osd_op_queue_cut_off configured as debug_random.
And I also tried to read PrioritizedQueue.h code, I think the reason caused this is:
1. enqueue_op repop(osd_op) this can cause a cow, priority is 127
2. enqueue_op MOSDPGPush ( this will use the snap generated by 1), priority is 63
if osd_op_queue_cut_off = high, then the up two op will be all put into queue, but queue has starvation strategy,
this may cause the Op order wrong, and lead to assert failed.

code like following:

// if there are multiple buckets/subqueues with sufficient tokens,
// we behave like a strict priority queue among all subqueues that
// are eligible to run.
for (typename SubQueues::iterator i = queue.begin();
i != queue.end();
++i) {
assert(!(i->second.empty()));
if (i->second.front().first < i->second.num_tokens()) {
T ret = i->second.front().second;
unsigned cost = i->second.front().first;
i->second.take_tokens(cost);
i->second.pop_front();
if (i->second.empty()) {
remove_queue(i->first);
}
distribute_tokens(cost);
return ret;
}
}

#2 Updated by Samuel Just over 7 years ago

If I understand this correctly, the bug is that with the strict cutoff at 127, it's possible for a push on a newly created object during backfill to end up ordered ahead of the repop which creates it. I expect that the bug is that
1) Issue repop which creates clone due to cow
2) Backfill catches up to that object and sends a push on another clone using the clone created in 1) as a source for clone_range for some common extents

The bug ultimately is that we are sending a push with clone sources which we are not holding ObjectContext locks on. The interface between ReplicatedBackend and ReplicatedPG will need to be extended a bit so that ReplicatedBackend won't use any source objects it can't lock. Those obcs will then need to be added to the set to unlock when we finish the push.

#3 Updated by Samuel Just over 7 years ago

  • Related to Bug #17831: osd: ENOENT on clone added

#4 Updated by Josh Durgin almost 7 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF