Project

General

Profile

Actions

Bug #8011

closed

osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

Added by Samuel Just about 10 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
giant,firefly
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ceph version 0.78-600-g19f50b9 (19f50b9d7bbbb2cce3b599f3ed8a9fa32c3d4e53)
1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1c86) [0x7f8976]
2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr&lt;ReplicatedPG::FlushOp&gt;)+0x72f) [0x7fa5ff]
3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x2da) [0x7fb0ea]
4: (C_Flush::finish(int)+0xa7) [0x856e77]
5: (Context::complete(int)+0x9) [0x66ed59]
6: (Finisher::finisher_thread_entry()+0x1c0) [0x9a7050]
7: (()+0x7e9a) [0x7f14f9c60e9a]
8: (clone()+0x6d) [0x7f14f82213fd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Related issues 4 (0 open4 closed)

Related to Ceph - Bug #10689: osd/ReplicatedPG.cc: FAILED assert(soid < scrubber.start || soid >= scrubber.end) on deep scrubDuplicate01/29/2015

Actions
Related to Ceph - Bug #10693: FAILED assert(soid < scrubber.start || soid >= scrubber.end): oi digest variantResolvedSamuel Just01/29/2015

Actions
Has duplicate Ceph - Bug #8747: OSD crash on scrub:osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end)Duplicate07/04/2014

Actions
Has duplicate Ceph - Bug #10433: OSD osd/ReplicatedPG.cc: 5540: FAILED assert(soid < scrubber.start || soid >= scrubber.end)Duplicate12/26/2014

Actions
Actions #1

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-08_14:01:14-rados:thrash-wip-7891-testing-basic-plana/178972

Actions #2

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-13_09:43:35-rados:thrash-testing-testing-basic-plana/189166

Actions #3

Updated by Sage Weil about 10 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-13_09:43:35-rados:thrash-testing-testing-basic-plana/189348

Actions #4

Updated by Samuel Just about 10 years ago

  • Status changed from New to In Progress
  • Assignee set to Samuel Just
Actions #5

Updated by Samuel Just about 10 years ago

ReplicatedPG::do_op already does the right thing as far as blocking ops which may flush. What remains is to avoid flushing objects with blocked obcs.

Actions #6

Updated by Samuel Just about 10 years ago

and to check that agent_work also does the right thing

Actions #7

Updated by Samuel Just about 10 years ago

  • Status changed from In Progress to 7
Actions #8

Updated by Sage Weil about 10 years ago

  • Source changed from other to Q/A

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2014-04-18_21:29:10-rados:thrash-testing-testing-basic-plana/202157

Actions #9

Updated by Samuel Just almost 10 years ago

  • Status changed from 7 to Resolved
Actions #10

Updated by Sage Weil almost 10 years ago

  • Status changed from Resolved to 12

this triggered again on c6ada53a146f3196e11f545cfc968fc21657aec6

0> 2014-05-02 11:26:12.053553 7fd1ee1b2700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7fd1ee1b2700 time 2014-05-02 11:26:12.039772
osd/ReplicatedPG.cc: 5282: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-05-02_02:30:10-rados-master-testing-basic-plana/229437

Actions #11

Updated by Samuel Just almost 10 years ago

  • Status changed from 12 to Fix Under Review
Actions #12

Updated by Samuel Just almost 10 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Samuel Just almost 10 years ago

  • Status changed from Pending Backport to Resolved
Actions #14

Updated by Sage Weil almost 10 years ago

  • Status changed from Resolved to 12

see #8747 for a log of this happening on 0.80.3

Actions #15

Updated by Sage Weil almost 10 years ago

  • Assignee deleted (Samuel Just)
Actions #16

Updated by Sage Weil over 9 years ago

  • Status changed from 12 to Can't reproduce

Pinged Dmitry to see if he is sitll seeing this or has a log

Actions #17

Updated by Dmitry Smirnov over 9 years ago

  • Status changed from Can't reproduce to Resolved

I'm unable to reproduce it any more, assuming fixed.

Actions #18

Updated by Sage Weil over 9 years ago

  • Status changed from Resolved to 12

this popped up again: ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-11-17_02:32:01-rados-giant-distro-basic-multi/604740

Actions #19

Updated by Samuel Just over 9 years ago

  • Assignee set to Samuel Just
Actions #20

Updated by Samuel Just over 9 years ago

Urgh, non-blocking flushes do not cause scrub to pause. I think the simplest solution is to fail a non-blocking scrub in try_flush_mark_clean if the object is being scrubbed.

Actions #21

Updated by Samuel Just over 9 years ago

  • Status changed from 12 to 7
Actions #22

Updated by Samuel Just over 9 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to giant,firefly
Actions #23

Updated by Sage Weil about 9 years ago

  • Status changed from Pending Backport to 12

happened again, i believe with the latest fix applied.

ubuntu@teuthology:/a/teuthology-2015-01-25_23:10:02-knfs-next-testing-basic-multi/722944

Actions #24

Updated by Yuri Weinstein about 9 years ago

Also see in run: http://pulpito.ceph.com/teuthology-2015-01-27_17:13:01-upgrade:firefly-x-next-distro-basic-multi/
Job: ['726068']
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-01-27_17:13:01-upgrade:firefly-x-next-distro-basic-multi/726068/teuthology.log

2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr:osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool, bool)' thread 7fd534fd4700 time 2015-01-28 07:11:26.502343
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr:osd/ReplicatedPG.cc: 5943: FAILED assert(soid < scrubber.start || soid >= scrubber.end)
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr: ceph version 0.91-388-g5064787 (50647876971a2fe65a02e4de3c0bc62fec4887c4)
2015-01-28T07:11:26.339 INFO:tasks.ceph.osd.0.burnupi16.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xadb3df]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 2: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool, bool)+0x1bd3) [0x826f93]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x16d) [0x845d6d]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 4: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xa13) [0x846963]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 5: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x2987) [0x8519a7]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x63f) [0x7ea05f]
2015-01-28T07:11:26.340 INFO:tasks.ceph.osd.0.burnupi16.stderr: 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x6647ef]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x66524f]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x65c) [0xacab4c]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xacd720]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 11: (()+0x7e9a) [0x7fd54fe2de9a]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: 12: (clone()+0x6d) [0x7fd54e5d8ccd]
2015-01-28T07:11:26.341 INFO:tasks.ceph.osd.0.burnupi16.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #25

Updated by Irek Fasikhov about 9 years ago

Yes, I have the same error.

[root@ceph04 ceph]# ceph -v
ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)

Fix will be in version 0.80.9? Thanks

2015-01-28 09:50:11.082466 7fbc1437b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7fbc1437b700 time 2015-01-28 09:50:10.852829
osd/ReplicatedPG.cc: 5318: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x32f5) [0x889275]
 2: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x110f) [0x8903ef]
 3: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type
, boost::tuples::null_type, boost::tuples::null_type>)+0x78) [0x8e29b8]
 4: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tupl
es::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type,
boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x8b34f5]
 5: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x747) [0x885407]
 6: (C_Copyfrom::finish(int)+0xb7) [0x8e2777]
 7: (Context::complete(int)+0x9) [0x667209]
 8: (Finisher::finisher_thread_entry()+0x1d8) [0x9ed148]
 9: (()+0x79d1) [0x7fbc36e5f9d1]
 10: (clone()+0x6d) [0x7fbc35dd88fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
Actions #26

Updated by Samuel Just about 9 years ago

Irek, you are seeing the previous incarnation of this bug. The relevant fix has not yet been backported to firefly.

Actions #27

Updated by Samuel Just about 9 years ago

This most recent incarnation is due to oi digests: we block_writes until the end of COMPARE_MAPS. The assert should not fire if !block_writes. Previously, we would always change scrubber.start to be the same as scrubber.end as we changed block_writes.

Actions #28

Updated by Samuel Just about 9 years ago

Making patch.

Actions #29

Updated by Samuel Just about 9 years ago

Also, this is an entirely distinct bug, so I'm making this Pending Backport again and opening a new one.

Actions #30

Updated by Samuel Just about 9 years ago

  • Status changed from 12 to Pending Backport
Actions #31

Updated by Irek Fasikhov about 9 years ago

Samuel.
yes, this is another mistake. http://tracker.ceph.com/issues/10433#note-3

Actions #32

Updated by Loïc Dachary about 9 years ago

  • Severity changed from 3 - minor to 2 - major
Actions #35

Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF