Project

General

Profile

Actions

Feature #18052

open

Replace past_intervals with more compact structure

Added by Samuel Just over 7 years ago. Updated about 7 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Currently, we maintain one record for every interval back to the last interval in which the pg went clean. This is pretty wasteful. We only actually use past_intervals for two things:
1) Generating the PriorSet and determining whether the pg can go active given responses from the currently up osds (see PG::PriorSet)
2) Generating the set of osds which might have unfound objects (see build_might_have_unfound).

For 2, we can simply track a complete set of osds which have been in the PG's acting set since the last time it went active+clean. For 1), we basically need to track the set of unique acting_sets over the set of maybe_went_rw intervals.

Tasks:
1) Add a comment to the PriorSet constructor explaining precisely how the lost_at condition works (I'm not totally clear on this, we'll need to understand it to replicate the condition with the new structure)
2) Add a document to doc/dev/osd_internals explaining the role past_intervals currently plays.
-- Here is where the new PR starts
3) Update that document to explain the new structure, how it replaces past_intervals, and how we deal with mixed clusters.
4) Implement the new structure including continuing to use the old structure for clusters without require_<target_version> set. When the OSDMap flag flips, OSDs will also need to be able to handle updating the in-memory representation with the new version and start using it.

Some thoughts:
I think the easiest way to do this would be to create a new PastIntervals type which is internally either the current representation or the new one (boost::variant probably). Encoding the current one encodes the current on-disk/on-wire encoding. Encoding the new variant encodes the new one. Decoding naturally chooses one or the other based on the struct_v value. We choose one or the other as the PG in-memory structures are initialized based on the flag in the OSDMap indicated by the PG's current epoch. This handles on-disk upgrades as well since we initialize the PG structure at the same map as the one used to write it out (exception: ceph-objectstore-tool, fix).

Actions

Also available in: Atom PDF