Feature #18052
openReplace past_intervals with more compact structure
0%
Description
Currently, we maintain one record for every interval back to the last interval in which the pg went clean. This is pretty wasteful. We only actually use past_intervals for two things:
1) Generating the PriorSet and determining whether the pg can go active given responses from the currently up osds (see PG::PriorSet)
2) Generating the set of osds which might have unfound objects (see build_might_have_unfound).
For 2, we can simply track a complete set of osds which have been in the PG's acting set since the last time it went active+clean. For 1), we basically need to track the set of unique acting_sets over the set of maybe_went_rw intervals.
Tasks:
1) Add a comment to the PriorSet constructor explaining precisely how the lost_at condition works (I'm not totally clear on this, we'll need to understand it to replicate the condition with the new structure)
2) Add a document to doc/dev/osd_internals explaining the role past_intervals currently plays.
-- Here is where the new PR starts
3) Update that document to explain the new structure, how it replaces past_intervals, and how we deal with mixed clusters.
4) Implement the new structure including continuing to use the old structure for clusters without require_<target_version> set. When the OSDMap flag flips, OSDs will also need to be able to handle updating the in-memory representation with the new version and start using it.
Some thoughts:
I think the easiest way to do this would be to create a new PastIntervals type which is internally either the current representation or the new one (boost::variant probably). Encoding the current one encodes the current on-disk/on-wire encoding. Encoding the new variant encodes the new one. Decoding naturally chooses one or the other based on the struct_v value. We choose one or the other as the PG in-memory structures are initialized based on the flag in the OSDMap indicated by the PG's current epoch. This handles on-disk upgrades as well since we initialize the PG structure at the same map as the one used to write it out (exception: ceph-objectstore-tool, fix).