Project

General

Profile

Bug #22672

OSDs frequently segfault in PrimaryLogPG::find_object_context() with empty clone_snaps vector on tier pool

Added by David Disseldorp 12 days ago. Updated 6 days ago.

Status:
Triaged
Priority:
Normal
Assignee:
-
Category:
Tiering
Target version:
-
Start date:
01/12/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No
Component(RADOS):

Description

Environment is a Luminous cache-tiered deployment with some of the hot-tier OSDs converted to bluestore. The remaining OSDs remain filestore.

OSDs are segfaulting in:
9927 int PrimaryLogPG::find_object_context(const hobject_t& oid,
9928 ObjectContextRef *pobc,
9929 bool can_create,
9930 bool map_snapid_to_clone,
9931 hobject_t *pmissing)
...
10131 // clone
10132 dout(20) << "find_object_context " << soid
10133 << " snapset " << obc->ssc->snapset
10134 << " legacy_snaps " << obc->obs.oi.legacy_snaps
10135 << dendl;
10136 snapid_t first, last;
10137 if (obc->ssc->snapset.is_legacy()) {
10138 first = obc->obs.oi.legacy_snaps.back();
10139 last = obc->obs.oi.legacy_snaps.front();
10140 } else {
10141 auto p = obc->ssc->snapset.clone_snaps.find(soid.snap);
10142 assert(p != obc->ssc->snapset.clone_snaps.end());
10143 first = p->second.back(); <------------- here
10144 last = p->second.front();
10145 }

The clone_snaps map has a single entry for soid.snap, but the vector is strangely empty, leading to the segfault above.

(gdb) p ((struct ObjectContext *)0x55678130e100)->ssc->snapset->clones
$3 = std::vector of length 1, capacity 1 = {{val = 65215}}
(gdb) p soid.snap
$19 = {val = 65215}
(gdb) p ((struct ObjectContext *)0x55678130e100)->ssc->snapset->clone_snaps
$2 = std::map with 1 element = {[{val = 65215}] = std::vector of length 0, capacity 0}

I'm currently investigating PrimaryLogPG::make_writeable() as a possible culprit for the bogus clone_snaps content, but figured I'd ask the upstream brains trust before diving deeper.

This code path appears to have been recently changed via:
commit 7f90c723949eebba9b9233ffdf3ea54efaca46aa
Author: Sage Weil <>
Date: Wed Feb 22 14:35:20 2017 -0600

osd: store clone list in SnapSet

History

#2 Updated by Greg Farnum 11 days ago

That looks like a good way to investigate. We've seen a few reports of issues with cache tier snapshots since that rewrite, but there hasn't been anything conclusive yet and it hasn't appeared in our regular testing. :/

#3 Updated by Greg Farnum 11 days ago

  • Project changed from Ceph to RADOS
  • Subject changed from OSDs frequently segfault in PrimaryLogPG::find_object_context() with empty clone_snaps vector to OSDs frequently segfault in PrimaryLogPG::find_object_context() with empty clone_snaps vector on tier pool
  • Category set to Tiering

#4 Updated by David Disseldorp 9 days ago

To (relatively) stabilise the frequently crashing OSDs, we've added an early -ENOENT return to PrimaryLogPG::find_object_context() on detection of an empty snapset.clone_snaps list. We've also dropped the assert in SnapMapper::get_snaps(). Root cause analysis and snapset metadata repair investigations are still ongoing.

#5 Updated by Joao Luis 6 days ago

  • Status changed from New to Triaged

Also available in: Atom PDF