Project

General

Profile

Feature #8227

RFE: introduce “back in a bit” osd state

Added by Alexandre Oliva almost 10 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

Sometimes I want to bring an osd down for a bit, say because it is slowing the cluster down, because I want to run some commands on the disk that will often result in data loss if run while the cluster is running (I'm tracking what appears to be a btrfs bug along these lines), or just because I want to reboot the server that holds it.

If ceph is configured to bring osds to the out state a bit after they fail, the PGs held in such a down osd will start being fully replicated to some other OSD, which is most likely excessive, since the OSD will be back in a bit, and making redundant copies will just slow things down.

Conversely, if ceph is configured to not set osds that are down to out automatically, or after a short time, newly-created or modified objects will have a lower replication count than what the PG is configured to hold, and there is going to be a window of exposure for as long as it takes the temporarily-down OSD to come back up and fully recover. Assuming the temporarily-down OSD doesn't bring the PG size below min-size, that is.

My suggestion is some middle ground: a state for the OSD that causes PGs to be remapped for alternate OSDs to hold replicas of modified objects only, but without starting backfilling of them. OSDs would go to “back in a bit” state right after failing (or after some configurable time), which would cause PGs in it to remap to other OSDs to hold copies of newly-created objects, and then (optionally) move to the state currently known as “out” after a longer period of time.

History

#1 Updated by Greg Farnum almost 10 years ago

We've discussed this sort of log-only replica at a few times in the past. It's conceptually simple, but unfortunately a fairly big effort to create and (especially) validate, since it creates a whole new kind of peer which doesn't fit into the normal peering and recovery framework at all. That means major changes to peering to differentiate between them and figure out when you do and don't have all the data you need, major changes to recovery and backfill to make sure "log-only" replicas which get promoted to full actually see all the data they need and can deal with having logs entries for data they don't have, etc etc etc.
None of that is to say it's impossible, but it's a major features rather than something one could hack together in a few days, and the use case for it so far appears sufficiently specialized that it's not become worthwhile.

#2 Updated by Alexandre Oliva almost 10 years ago

Err... I'm confused; it looks like we already do pretty much everything it would take to implement this feature. Say, while we're recovering or backfilling an osd, we already push to it objects that are modified, don't we? So all it would take would be a PG state that was like these, but that would NOT push other objects to it. Something like wait_biab, very much like wait_backfill or wait_recovery. What am I missing?

#3 Updated by Ian Colle almost 10 years ago

  • Tracker changed from Bug to Feature
  • Source changed from other to Community (dev)

#4 Updated by Greg Farnum almost 10 years ago

You're missing the ways in which this differs from existing recovery mechanisms:
1) For this state, we would want to keep a log which contains the changes to objects. In all other states, we keep a log that references changed objects (but not the contents of the change).
2) When recovering from an OSD that is partly up-to-date, we have a single line (backfill_thru or whatever it is) we can point to and say "everything prior to this line is up-to-date. With this feature, if we have an old copy of the PG which overlaps with this log, we instead need to recover whatever random objects are in the full data log.
3) Right now, OSDs who have newer data than their peers have a full copy of the data, and backfill works by copying the full object (not just changed extents). We'd have to handle this too.

etc etc etc

None of this makes a feature like this a bad idea nor an infeasible one to implement, but it is sufficiently large that it will require dedicated work and a use case compelling enough to make somebody pay for it. :)

#5 Updated by Alexandre Oliva almost 10 years ago

Greg, it looks like you're discussing difficulties related with the more elaborate plans for the log-only replica, which fulfills a similar goal but is quite different from what I suggested.

I've observed on my own cluster an osd (say osd0) start backfilling a PG (say 0.0) onto another osd (say osd1) while a third replica (say osd2) was down, while the PG had objects created, modified and removed. Part-way through the backfill, osd2 came back and osd0 went down, and osd2 could recover from osd1 the changes that had been made while osd2 was down, even though osd1 had been only partially backfilled. Once recovery was completed, it advanced to backfill state and continued backfilling osd1.

Now, while changes were being made to both osd0 and osd1, osd0 was also backfilling osd1. My suggestion amounts to nothing more than introducing a state in which osd0 “starts backfilling” osd1, but doesn't really backfill any objects, sending to osd1 only the (full) objects that it would send because they are modified.

Heck, we even have a global nobackfill state that might accomplish this by keeping PGs in wait_backfill state; it's just not on a per-osd basis. Assuming modified objects are pushed to the to-be-backfilled replicas even in wait_backfill, that is.

Did I get any of the above wrong?

#6 Updated by Greg Farnum almost 10 years ago

I can't really follow the story you're telling, but even if you are backfilling only changed objects, you still have all of the "I hold a random subset of objects" difficulties later on. More than that, you now need to copy every object being touched to a new OSD before the write can go through.
And modified objects are not pushed to out-of-date OSDs when in wait_backfill. :)

#7 Updated by Patrick Donnelly about 5 years ago

  • Status changed from New to Resolved

We have a reasonable approximation of this with noout. Closign this.

Also available in: Atom PDF