Project

General

Profile

Bug #8747

OSD crash on scrub:osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

Added by Dmitry Smirnov over 5 years ago. Updated over 5 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

On 0.80.1 one OSD crashed several times as follows (full log attached):

osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7fb0ee9d0700 time 2014-07-05 06:
21:00.105868
osd/ReplicatedPG.cc: 5297: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0xad8) [0x7fb10e0c35a8]
 2: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x48e) [0x7fb10e0c868e]
 3: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_typ
e, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x64) [0x7fb10e131364]
 4: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x4df) [0x7fb10e0c74cf]
 5: (C_Copyfrom::finish(int)+0x12a) [0x7fb10e13125a]
 6: (Context::complete(int)+0x9) [0x7fb10df2d559]
 7: (Finisher::finisher_thread_entry()+0x1b8) [0x7fb10e2accd8]
 8: (()+0x80ca) [0x7fb10d40b0ca]
 9: (clone()+0x6d) [0x7fb10b91fffd]

ceph-osd.0.log.xz (131 KB) Dmitry Smirnov, 07/04/2014 05:42 PM


Related issues

Duplicates Ceph - Bug #8011: osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end) Resolved 04/07/2014

History

#1 Updated by Dmitry Smirnov over 5 years ago

#2 Updated by Dmitry Smirnov over 5 years ago

May be a duplicate of #8011

#3 Updated by Dmitry Smirnov over 5 years ago

I use my local build of 0.80.1 with 29ee6faecb9e16c63acae8318a7c8f6b14367af7 (from branch "firefly") applied yet this problem has happened...

#4 Updated by Dmitry Smirnov over 5 years ago

  • Status changed from New to Closed

I found that two OSDs of 12 were running 0.80.1 without backported patch from #8011.
Interesting to note that the affected OSD was patched.
I re-built Ceph from head of "firefly" branch and upgraded the whole cluster.
Since then I could not reproduce the problem.
This bug appears to be fixed so I'm closing it for now.

#5 Updated by Dmitry Smirnov over 5 years ago

  • Status changed from Closed to New

Re-opening as I just reproduced the issue. Sorry.
This happened again (probably) on attempt to repair inconsistent PG.
Please advise.

#6 Updated by Dmitry Smirnov over 5 years ago

No improvement with 0.80.3 -- I'm still getting those crashes frequently on "deep-scrub" and "repair".
Sometimes two OSD crash simultaneously.

#7 Updated by Dmitry Smirnov over 5 years ago

Although it takes up to an hour to reproduce I seems to have a reliable way to do so.
I shall be happy to capture detailed logs (e.g. `debug osd = 20, debug filestore = 20, debug ms = 1`) if necessary.

#8 Updated by Sage Weil over 5 years ago

  • Status changed from New to Duplicate

see #8011

#9 Updated by Samuel Just over 5 years ago

Yeah, 8011 seems to be less dead then we thought, reopening.

#10 Updated by Dmitry Smirnov over 5 years ago

I can't reproduce any more on 0.80.5 + Firefly HEAD as of 2014-09-16...

Also available in: Atom PDF