Project

General

Profile

Actions

Bug #8889

closed

osd/ReplicatedPG.cc: 5162: FAILED assert(got)

Added by Sage Weil over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ubuntu@teuthology:/a/teuthology-2014-07-20_02:30:01-rados-next-testing-basic-plana/371321

This is in the base tier. The sequence seesm to be something like:

- flush snap 3. we modify the old 3, write a new head (this seems wrong?)
- delete head; fails to take obc write lock on head

     0> 2014-07-20 13:37:07.920267 7f35d708e700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f35d708e700 time 2014-07-20 13:37:07.916941
osd/ReplicatedPG.cc: 5162: FAILED assert(got)

 ceph version 0.82-391-g4a63396 (4a63396ba1611ed36cccc8c6d0f5e6e3e13d83ee)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x2892) [0x809b12]
 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x1aa) [0x8243aa]
 3: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xb6f) [0x8250ff]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x2d15) [0x82fee5]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x692) [0x7cb352]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1ca) [0x65277a]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x6c1) [0x6532c1]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x6fc) [0xa749bc]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa762a0]
 10: (()+0x7e9a) [0x7f35f093ee9a]
 11: (clone()+0x6d) [0x7f35eeeff3fd]

Actions #1

Updated by Greg Farnum over 9 years ago

Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?

Actions #2

Updated by Sage Weil over 9 years ago

Greg Farnum wrote:

Maybe I misunderstand, but if we're flushing snapshot 3, we need to write it (using old snapcontext, obviously) and then protect it (i.e., send along a null write or something with updated snap context). Is that what you mean happened, or something else?

I hadn't gotten that far :).

Actions #3

Updated by Sage Weil over 9 years ago

  • Category set to OSD
  • Status changed from New to 7

- there is an in-progress copy_from
- backfill advances up to the snapdir object, sets backfill_pos, sets backfill_read_marker (and blocks!)
- write_copy_chunk deletes snapdir, writes head, make_writeable, takes write lock (1- but not on snapdir object)
- a delete on the head, make_writeable, fails to get_write() on the obc (bc of backfill_read marker)

two bugs:

1- we should have failed earlier during the snapdir deletion step
2- actually, neither step should fail.. we should skip the anti-starvation check for backfill read (or op waiters).

Actions #4

Updated by Sage Weil over 9 years ago

  • Status changed from 7 to Pending Backport
Actions #5

Updated by Sage Weil over 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF