Bug #8889: osd/ReplicatedPG.cc: 5162: FAILED assert(got) - Ceph - Ceph

Actions

Copy link

Bug #8889

closed

osd/ReplicatedPG.cc: 5162: FAILED assert(got)

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Sage Weil

Category:

OSD

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-07-20_02:30:01-rados-next-testing-basic-plana/371321

This is in the base tier. The sequence seesm to be something like:

- flush snap 3. we modify the old 3, write a new head (this seems wrong?)
- delete head; fails to take obc write lock on head

     0> 2014-07-20 13:37:07.920267 7f35d708e700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f35d708e700 time 2014-07-20 13:37:07.916941
osd/ReplicatedPG.cc: 5162: FAILED assert(got)

 ceph version 0.82-391-g4a63396 (4a63396ba1611ed36cccc8c6d0f5e6e3e13d83ee)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x2892) [0x809b12]
 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x1aa) [0x8243aa]
 3: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xb6f) [0x8250ff]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x2d15) [0x82fee5]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x692) [0x7cb352]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ca) [0x65277a]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x6532c1]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa749bc]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa762a0]
 10: (()+0x7e9a) [0x7f35f093ee9a]
 11: (clone()+0x6d) [0x7f35eeeff3fd]

Actions

Copy link

Updated by Greg Farnum over 9 years ago

Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?

Actions

Copy link

Updated by Sage Weil over 9 years ago

Greg Farnum wrote:

Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?

I hadn't gotten that far :).

Actions

Copy link

Updated by Sage Weil over 9 years ago

Category set to OSD
Status changed from New to 7

- there is an in-progress copy_from
- backfill advances up to the snapdir object, sets backfill_pos, sets backfill_read_marker (and blocks!)
- write_copy_chunk deletes snapdir, writes head, make_writeable, takes write lock (1- but not on snapdir object)
- a delete on the head, make_writeable, fails to get_write() on the obc (bc of backfill_read marker)

two bugs:

1- we should have failed earlier during the snapdir deletion step
2- actually, neither step should fail.. we should skip the anti-starvation check for backfill read (or op waiters).

Actions

Copy link