Project

General

Profile

Bug #8011

osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

Added by Samuel Just over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
04/07/2014
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
giant,firefly
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ceph version 0.78-600-g19f50b9 (19f50b9d7bbbb2cce3b599f3ed8a9fa32c3d4e53)
1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1c86) [0x7f8976]
2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr&lt;ReplicatedPG::FlushOp&gt;)+0x72f) [0x7fa5ff]
3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x2da) [0x7fb0ea]
4: (C_Flush::finish(int)+0xa7) [0x856e77]
5: (Context::complete(int)+0x9) [0x66ed59]
6: (Finisher::finisher_thread_entry()+0x1c0) [0x9a7050]
7: (()+0x7e9a) [0x7f14f9c60e9a]
8: (clone()+0x6d) [0x7f14f82213fd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Related issues

Related to Ceph - Bug #10689: osd/ReplicatedPG.cc: FAILED assert(soid < scrubber.start || soid >= scrubber.end) on deep scrub Duplicate 01/29/2015
Related to Ceph - Bug #10693: FAILED assert(soid < scrubber.start || soid >= scrubber.end): oi digest variant Resolved 01/29/2015
Duplicated by Ceph - Bug #8747: OSD crash on scrub:osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end) Duplicate 07/04/2014
Duplicated by Ceph - Bug #10433: OSD osd/ReplicatedPG.cc: 5540: FAILED assert(soid < scrubber.start || soid >= scrubber.end) Duplicate 12/26/2014

Associated revisions

Revision e66f2e36 (diff)
Added by Samuel Just over 4 years ago

ReplicatedPG: block scrub on blocked object contexts

Fixes: #8011
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>

Revision 0f3235d4 (diff)
Added by Samuel Just over 4 years ago

ReplicatedPG: block scrub on blocked object contexts

Fixes: #8011
Signed-off-by: Samuel Just <>
Reviewed-by: Sage Weil <>
(cherry picked from commit e66f2e36c06ca00c1147f922d3513f56b122a5c0)

Revision db4ccb04 (diff)
Added by Samuel Just over 4 years ago

ReplicatedPG: block scrub on blocked object contexts

Fixes: #8011
Backport: firefly
Signed-off-by: Samuel Just <>

Revision 74114771 (diff)
Added by Samuel Just over 4 years ago

ReplicatedPG: block scrub on blocked object contexts

Fixes: #8011
Backport: firefly
Signed-off-by: Samuel Just <>

Revision 29ee6fae (diff)
Added by Samuel Just over 4 years ago

ReplicatedPG: block scrub on blocked object contexts

Fixes: #8011
Backport: firefly
Signed-off-by: Samuel Just <>
(cherry picked from commit 7411477153219d66625a74c5886530029c516036)

Revision 9b26de3f (diff)
Added by Samuel Just almost 4 years ago

ReplicatedPG: fail a non-blocking flush if the object is being scrubbed

Fixes: #8011
Backport: firefly, giant
Signed-off-by: Samuel Just <>

Revision f8567398 (diff)
Added by Samuel Just over 3 years ago

ReplicatedPG: fail a non-blocking flush if the object is being scrubbed

Fixes: #8011
Backport: firefly, giant
Signed-off-by: Samuel Just <>
(cherry picked from commit 9b26de3f3653d38dcdfc5b97874089f19d2a59d7)

Revision 681c99fe (diff)
Added by Samuel Just over 3 years ago

ReplicatedPG: fail a non-blocking flush if the object is being scrubbed

Fixes: #8011
Backport: firefly, giant
Signed-off-by: Samuel Just <>
(cherry picked from commit 9b26de3f3653d38dcdfc5b97874089f19d2a59d7)

History

#1 Updated by Sage Weil over 4 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-08_14:01:14-rados:thrash-wip-7891-testing-basic-plana/178972

#2 Updated by Sage Weil over 4 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-13_09:43:35-rados:thrash-testing-testing-basic-plana/189166

#3 Updated by Sage Weil over 4 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-13_09:43:35-rados:thrash-testing-testing-basic-plana/189348

#4 Updated by Samuel Just over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to Samuel Just

#5 Updated by Samuel Just over 4 years ago

ReplicatedPG::do_op already does the right thing as far as blocking ops which may flush. What remains is to avoid flushing objects with blocked obcs.

#6 Updated by Samuel Just over 4 years ago

and to check that agent_work also does the right thing

#7 Updated by Samuel Just over 4 years ago

  • Status changed from In Progress to Testing

#8 Updated by Sage Weil over 4 years ago

  • Source changed from other to Q/A

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-18_21:29:10-rados:thrash-testing-testing-basic-plana/202157

#9 Updated by Samuel Just over 4 years ago

  • Status changed from Testing to Resolved

#10 Updated by Sage Weil over 4 years ago

  • Status changed from Resolved to Verified

this triggered again on c6ada53a146f3196e11f545cfc968fc21657aec6

0> 2014-05-02 11:26:12.053553 7fd1ee1b2700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7fd1ee1b2700 time 2014-05-02 11:26:12.039772
osd/ReplicatedPG.cc: 5282: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-05-02_02:30:10-rados-master-testing-basic-plana/229437

#11 Updated by Samuel Just over 4 years ago

  • Status changed from Verified to Need Review

#12 Updated by Samuel Just over 4 years ago

  • Status changed from Need Review to Pending Backport

#13 Updated by Samuel Just over 4 years ago

  • Status changed from Pending Backport to Resolved

#14 Updated by Sage Weil over 4 years ago

  • Status changed from Resolved to Verified

see #8747 for a log of this happening on 0.80.3

#15 Updated by Sage Weil over 4 years ago

  • Assignee deleted (Samuel Just)

#16 Updated by Sage Weil about 4 years ago

  • Status changed from Verified to Can't reproduce

Pinged Dmitry to see if he is sitll seeing this or has a log

#17 Updated by Dmitry Smirnov about 4 years ago

  • Status changed from Can't reproduce to Resolved

I'm unable to reproduce it any more, assuming fixed.

#18 Updated by Sage Weil almost 4 years ago

  • Status changed from Resolved to Verified

this popped up again: ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-11-17_02:32:01-rados-giant-distro-basic-multi/604740

#19 Updated by Samuel Just almost 4 years ago

  • Assignee set to Samuel Just

#20 Updated by Samuel Just almost 4 years ago

Urgh, non-blocking flushes do not cause scrub to pause. I think the simplest solution is to fail a non-blocking scrub in try_flush_mark_clean if the object is being scrubbed.

#21 Updated by Samuel Just almost 4 years ago

  • Status changed from Verified to Testing

#22 Updated by Samuel Just almost 4 years ago

  • Status changed from Testing to Pending Backport
  • Backport set to giant,firefly

#23 Updated by Sage Weil almost 4 years ago

  • Status changed from Pending Backport to Verified

happened again, i believe with the latest fix applied.

ubuntu@teuthology:/a/teuthology-2015-01-25_23:10:02-knfs-next-testing-basic-multi/722944

#24 Updated by Yuri Weinstein almost 4 years ago

Also see in run: http://pulpito.ceph.com/teuthology-2015-01-27_17:13:01-upgrade:firefly-x-next-distro-basic-multi/
Job: ['726068']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-01-27_17:13:01-upgrade:firefly-x-next-distro-basic-multi/726068/teuthology.log

2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool, bool)' thread 7fd534fd4700 time 2015-01-28 07:11:26.502343
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr:osd/ReplicatedPG.cc: 5943: FAILED assert(soid < scrubber.start || soid >= scrubber.end)
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xadb3df]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 2: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool, bool)+0x1bd3) [0x826f93]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x16d) [0x845d6d]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xa13) [0x846963]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2987) [0x8519a7]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7ea05f]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x6647ef]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x66524f]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x65c) [0xacab4c]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xacd720]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 11: (()+0x7e9a) [0x7fd54fe2de9a]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 12: (clone()+0x6d) [0x7fd54e5d8ccd]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#25 Updated by Irek Fasikhov almost 4 years ago

Yes, I have the same error.

[root@ceph04 ceph]# ceph -v
ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)

Fix will be in version 0.80.9? Thanks

2015-01-28 09:50:11.082466 7fbc1437b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7fbc1437b700 time 2015-01-28 09:50:10.852829
osd/ReplicatedPG.cc: 5318: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x32f5) [0x889275]
 2: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x110f) [0x8903ef]
 3: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type
, boost::tuples::null_type, boost::tuples::null_type>)+0x78) [0x8e29b8]
 4: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tupl
es::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x8b34f5]
 5: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x747) [0x885407]
 6: (C_Copyfrom::finish(int)+0xb7) [0x8e2777]
 7: (Context::complete(int)+0x9) [0x667209]
 8: (Finisher::finisher_thread_entry()+0x1d8) [0x9ed148]
 9: (()+0x79d1) [0x7fbc36e5f9d1]
 10: (clone()+0x6d) [0x7fbc35dd88fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---

#26 Updated by Samuel Just almost 4 years ago

Irek, you are seeing the previous incarnation of this bug. The relevant fix has not yet been backported to firefly.

#27 Updated by Samuel Just almost 4 years ago

This most recent incarnation is due to oi digests: we block_writes until the end of COMPARE_MAPS. The assert should not fire if !block_writes. Previously, we would always change scrubber.start to be the same as scrubber.end as we changed block_writes.

#28 Updated by Samuel Just almost 4 years ago

Making patch.

#29 Updated by Samuel Just almost 4 years ago

Also, this is an entirely distinct bug, so I'm making this Pending Backport again and opening a new one.

#30 Updated by Samuel Just almost 4 years ago

  • Status changed from Verified to Pending Backport

#31 Updated by Irek Fasikhov almost 4 years ago

Samuel.
yes, this is another mistake. http://tracker.ceph.com/issues/10433#note-3

#32 Updated by Loic Dachary over 3 years ago

  • Severity changed from 3 - minor to 2 - major

#35 Updated by Loic Dachary over 3 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF