Bug #8889
closedosd/ReplicatedPG.cc: 5162: FAILED assert(got)
0%
Description
ubuntu@teuthology:/a/teuthology-2014-07-20_02:30:01-rados-next-testing-basic-plana/371321
This is in the base tier. The sequence seesm to be something like:
- flush snap 3. we modify the old 3, write a new head (this seems wrong?)
- delete head; fails to take obc write lock on head
0> 2014-07-20 13:37:07.920267 7f35d708e700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f35d708e700 time 2014-07-20 13:37:07.916941 osd/ReplicatedPG.cc: 5162: FAILED assert(got) ceph version 0.82-391-g4a63396 (4a63396ba1611ed36cccc8c6d0f5e6e3e13d83ee) 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x2892) [0x809b12] 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x1aa) [0x8243aa] 3: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xb6f) [0x8250ff] 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x2d15) [0x82fee5] 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x692) [0x7cb352] 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ca) [0x65277a] 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x6532c1] 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa749bc] 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa762a0] 10: (()+0x7e9a) [0x7f35f093ee9a] 11: (clone()+0x6d) [0x7f35eeeff3fd]
Updated by Greg Farnum over 9 years ago
Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?
Updated by Sage Weil over 9 years ago
Greg Farnum wrote:
Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?
I hadn't gotten that far :).
Updated by Sage Weil over 9 years ago
- Category set to OSD
- Status changed from New to 7
- there is an in-progress copy_from
- backfill advances up to the snapdir object, sets backfill_pos, sets backfill_read_marker (and blocks!)
- write_copy_chunk deletes snapdir, writes head, make_writeable, takes write lock (1- but not on snapdir object)
- a delete on the head, make_writeable, fails to get_write() on the obc (bc of backfill_read marker)
two bugs:
1- we should have failed earlier during the snapdir deletion step
2- actually, neither step should fail.. we should skip the anti-starvation check for backfill read (or op waiters).
Updated by Sage Weil over 9 years ago
- Status changed from Pending Backport to Resolved