Transcript - Erasure coded storage backend (step 2)

Transcript of the Erasure coded storage backend (step 2) presentation held August 6th, 2013 during the Emperor summit.

Patrick Mc Garry : The next one we have here is erasure coding backend and it looks like we have Loic and Christophe on so we should be able to do that. Loic you want to give a little bit of background on this for the ones who are watching this but do not know what it is about ? And then we'll jump in to where we're at now.
Loic Dachary : From the blueprint I will comment on the first part of the blueprint where I drew a comparison between replicas and erasure coding. So hopefull that will introduce people who are not already familiar with erasure coding. At the top of the drawing in the blueprint it shows what happens when a client requires to store a 10MB object with a replicated ceph pool, if you have a three replicas ceph pool amounting to 30MB. At the bottom of the drawing I sketched what happens with erasure coding : at the bottom you have the same 10MB object and erasure coding essentially slices this object in ten pieces of 1MB each and then it calculates three parity chunks ( for instance ) if you call your encoding function with parameters M=10 and K=3 meaning I want to slice my data in 10 and K=3 meaning I want 3 parity chunks. So it gets 13 chunks total and the way we think it will be implemented is that each chunk will go to a different OSD so I drew 13 OSD and each of them will get a chunk. So instead of storing 30MB you store 13MB so it's a huge win. Now what happens if you lose an OSD ? The situation is actually better than with the replicas because the erasure coding functions are so you can lose up to K chunks and still be able to recover the original object. If you loose, for instance, one parity chunk and two data chunks then it's enough that you read ten chunks, any chunks, to rebuild the original 10MB object. So it's very appealing, both in terms of space usage and resilience. That's the core of my understanding of erasure coding, it's also the motivation of my company to pay my salary and work on that. Now, it's not just my company : the blueprint was originally proposed by AnnaiSystems who works with genomics data and in general anyone with a large set of fairly cold data would be interested in such a win.
After this introduction, back to the PAD that is a short agenda of what I'm proposing to talk about and I'm already a bit behind because it's already 5 minutes in. Technically speaking it was fairly obvious from the beginning that erasure coding would be a property of a pool and as such implemented in placement groups. What happened during dumpling is that is became fairly obvious that we should limit the operations supported by the erasure coded pool to write, append and read of course. That is : all the fancy stuff that replicated pools can do might not be useful for erasure coding pool. And the reason is that with automatic tiering you can get the flexibility of a replicated pool in front of the erasure coded pool and when you want to save space you just demote the objects, automatically, to the erasure coded pool. So there is need to cope with the complexity of implementing the fancy operations on the erasure coded pool although that might be fun. What happened quite recently is that Samuel Just produced a high level design of the implementation which is going to be discussed next. It has been enlightening for me because in the past three months I did not no very much about the code and I discovered a lot of it but I was still unable to produce that high level design. Now we have it and hopefully for Emperor we will be able to complete this high level design : maybe there are a few things missing. And I will be working, hopefully with others, on implementing the core changes and the PG architecture that will allow the introduction of the PGBackend.h that will make the implementation of the erasure coded pool possible. I also thought ( but it was just this morning ) that it would be interesting for a few alpha tester to get ready and start trying things because it may take a few weeks for people to get ready with machines to run the tests. Maybe now is a good time to call for alpha testers. I'm now 8 minutes into the 5 minutes introduction and I propose that Sam gives a walkthru of his erasure coding design document.
Samuel Just : The main trick with erasure coded PG is managing log recovery during peering. For those who aren't familiar, ReplicatedPG is (currently the way the OSDs work) each placement group maintains an ordered log of all operations back to whatever the last time it was trimmed which is some number where the last time the PG was clean. And in the event that the acting set changes, the primary OSD will gather logs from everyone who may have touched the placement group recently and it will use those to determine an authoritative history for events. At that point it will either roll forward or back all members of the acting set to that log, using their logs to determine which objects need to be recovered. This works fine for replicated PG because whichever OSD had the authoritative log must have objects that match it so if you would need to rollback object foo because it was changed in the replicas log but not the authoritative log then we know that the authoritative log's OSD has a clean copy of that object, for the most part. That's not necessarily true for erasure coded PG though because you can imagine a code with ten chunks and and four parity chunks where you can find seven for the old version and seven for the old version where neither would be able to be reconstructed. So I think the key is to be able to ensure that using the log events alone we can always roll a placement group back though not necessarily forward. So to that end I suggest that we restrict the erasure coded ( this is under the client write section ) operations to CEPH_OSD_OP_APPEND, CEPH_OSD_OP_DELETE and CEPH_OSD_OP_(SET|RM)ATTR. Append is fairly straightforward, we simply keep the old size of the object in the log event and rolling it back is as simple as truncating it to the size listed in the log event. Delete is a bit of a problem because if we need to rollback we'll need the old copy of the object. So I think the way we do that is by writing a out versioned object and keeping around the old version of the object until the last_complete of the PG catches up with the version at which the object was deleted. At which point the replicas can independantly garbage collect behind the log as long as we never garbage collect past the last_complete point we can be sure that we won't have to inadvertendly have to rollback a delete that we now can't rollback. But we'll get to that in peering. The attribute part could be skipped except that I expect that it's necessary for managing metadata related to tiering and other things like that. So, as long as attributes are small we can simply include the information from the attribute, overwritten or removed in the log event and use that to roll it back. If anyone has a better idea that that, it would be good.

Sage Weil : I think that seems fine. They'll be small.
Samuel Just : Now to peering and PG logs. ??? into that we need to know how to chose an authoritative log. With replicated PG, as long as we have a log from the most recent interval in which the OSDs went active any such log will do, so we only need to talk to one replicas from the most recent interval in which the placement group went active. With erasure coded PG that's potentially dangerous because it's possible that we select ...

Loic Dachary : I have a question about that. I'm not sure to understand the rationale. When you have two replicas and two logs from the most recent interval, you chose the one that has the longest tail. Why is that ?
Samuel Just : it makes it more likely that we'll be able to do log based recovery. So skipping forward a bit, there are two kinds of recovery. There is log based recovery which we can do when two OSDs have overlapping PG logs that means we know the exact set of objects that needs to be recovered in order to bring the OSD up to date for that placement group. If the logs do not overlap, we don't know which objects have changed and we need to scan the PG collection to determine which objects are out of date : this is known as backfill. So a longer log tail means that we should be able to avoid backfill on more OSDs. Does that make sense ?
Loic Dachary : Yes it does.
Samuel Just : Ok. The trick with erasure coded PG is that if we only talk to one OSD from the last interval in which the OSD went active it's possible that we chose a log that is logger than anyone elses log which is a problem because while we can rollback, we can't necessarily roll forward. So it seems to me that we need to talk to at least to M ( where M is the number of chunks needed to reconstruct an object ) and chose the log among those with the oldest last_update and then roll everyone back to that point. This is safe because if a log event was not commited to all of the OSDs pg logs then it must not have been reported commited to the client so the client is going to replay it.
Loic Dachary : is true ? Because if we have M=10 + K=3 so you get ten OSDs back and out of these ten only nine recorded the log and one didn't. But let's say one of the OSDs that disapeared did record the log so that would make ten and that could be enough for the write operation to successfull be commited.
Samuel Just : so, that's a detail, I'm assuming we won't report it commited until we've commited to all of the OSDs. If it's the case that we chose to report it commited earlier, then the strategy will need to be modified.
Loic Dachary : so reporting success means all the OSDs recorded the log.
Sage Weil : the up ones, yes. And we'll know who that is I guess from the previous peering interval that we have.
Samuel Just : correct. So that's another area, if anyone has a better idea that would be good. So the rest of this is relatively straightforward. Well almost, ok :-)
Sage Weil : :-)
Samuel Just : There are two more pieces. With the crush rules you've all been writing if you lose the primary typically the second, the third etc. will shuffle forward one in the acting set. We don't want that to happen for erasure coded PGs because we want the acting set position to map onto the chunk position. This is fine crush has a crush chose independant mode ( indep ) that will chose each position in the acting set independantly of the others. So that's fine but we need to modify it slightly so that we'll leave holes if you can't fill a slot. So that's a relatively minor change but the acting set needs to be able to have empty positions. Similarly we need to be able to decouple primariness from the first element of the acting set. Currently whichever OSD ends up in the acting set position zero ends up as primary. And that's also fine for the most part for erasure coded PG but in certain scenarios when we are generating PG temp mappings we would need to generate a primary that is independant from the acting set position zero. And this is important for backfill. So currently the way backfill works is that when the primary identifies that it wants to bring an empty OSD into the placement group, it will generate a PG temp mapping with as many previous holders of the PG as possible so that it can continue serving reads and write with as much of the correct replica size as possible and then it will stick the backfill peer onto the end of that acting set but not actually replicate all I/Os to it. And it will backfill to it in the background. This is inconvenient for erasure coded PGs because whatever acting set slot we would have selected for the backfill peer, we probably wanted to select for a previous holder of the PG. For one thing. Secondly, if the backfill peer is eventually going to come in as an acting set slot zero we, for one, don't want it to be the primary and two we don't want to backfill it for something else. So, for those reasons, you should be able to declare in a PG temp that an OSD should be primary independant of its position in the acting set and backfill peer should not show up in the acting set. So these are the core changes related to the acting set.
Loic Dachary : I understand the logic for decoupling the primary from being in the first position. I'm not sure to understand why you would not want the backfill peer to show in the acting set.
Samuel Just : only because it's not clear ?? position ?? show up. So if the acting set changes from [0,1,2] to [3,1,2] for example and for some reason OSD 0 is still around then it's likely that we want probably OSD 1 to be primary and we want the acting set for the time being to be [0,1,2]. We want OSD 3 be backfilled but we want it to be backfilled for the acting set zero position, not for example, for the acting set four position.
Sage Weil : so it occurs to me that if we can specify the primary it can actually use the acting set ? once because because clients are doing all I/Os to the primary so it could have its own internal map. Regardless of what order crush is returning things. And then it's like : oh, I'll use the third one. The primary can do that translation unless the client needs to go directly. The initial implementation is going to have all reads / writes going to the primary?
Samuel Just : yes
Sage Weil : so it might not matter in that case. But later when we want to have the client read directly from the replicas or the shards or whatever then that would not work. But that might simplify things a little bit.
Samuel Just : so that would be analogous replicated PG not actually create a PG temp but creating a temporary primary and that primary is allowed to simply create several acting sets at once ?
Sage Weil : yes
Samuel Just : There are a few places where the replicas use the fact that they are in the acting set to make some minor decisions but those are all replaceable.
Sage Weil : That's going to be sort of the hard part of this change because it's getting dependencies of knowing when to refresh, reset the state machine and so forth. Catches in all these places.
Samuel Just : we probably shoulnd't
Sage Weil : it still seems to me that it would be a worthwhile change to decouple backfill from the acting set because then, for example, we could run multiple backfills in parallel for example. Who knows what.
Samuel Just : that one is reasonable. The question is : do we want to manipulate the acting set so that it maps on to where the I/O is actually going ? I think we want to actually keep doing that because any change in the acting set needs to be reflected as an OSDMap visible change. I think any change in the acting set does.
Sage Weil : yes. Ok.
Samuel Just : backfill peers don't matter for that though because we don't send them I/Os.
Sage Weil : Allright, from the clients perspective, right.
Samuel Just : And for subsequent recovery perspective.
Sage Weil : although, notably the client only resend I/O, at least for the replicated pgs, they only send recent I/Os if the primary changes, not the acting set. Samuel Just : I assume that from the OSDs point of view : if the interval has changed.
Sage Weil : ok, that makes sense to me.
Samuel Just : and we can also chose to never query backfill peers, during the GetInfo phase because we have all that irritating logic that sorts out which logs comes from backfill peers therefore are not as useful as they could be.
Sage Weil : yes.
Patrick McGarry : I see comments from IRC, who's doing erasure coding ? The client or another OSD ?
Loic Dachary : I think the question is : will the client know about the erasure coding logic. And initially : no. The OSD will do it and it will be transparent to the client.
Patrick McGarry : IRC, Vincent : If its another OSD, I can think of just replicating logs over 2/3 OSDs.
Samuel Just : we exploit the fact that the logs are written out atomically with the actual ? so decoupling the logs from the object would complicate things, a lot. If I understand that correctly.
Sage Weil : If you're doing an append. Oh no, you would have to write an entire stripe at a time. Ok. So the log is actually replicated on all members of the PG. There are a couple of questions that we should probably address in the PAD. There is the question about the coll_t
Samuel Just : that's actually a pretty boring answer. So we're suggesting that the OSD be able to hold multiple copies of the same placement group. Each copy of the placement group gets its own coll_t. Because they get their own collections. So the collection needs another _chunk_id or something
Loic Dachary : My question was rather that if you store that in the coll_t it will get parsed as pg_t so when you change the chunk_id you have to rename the directory. Could it be maybe stored in the pg_info_t that is loaded from leveldb ?
Samuel Just ; coll_t will probably parse into a cpg_t in the future, because that's what it is now.
Sage Weil : or maybe pg_shard_t or ...
Loic Dachary : I think what troubles me too is that taking this aproach you hardwire the fact that there is a chunk id although there might be other backends in the future. What if another backend does not need that chunk_id ? What happens then ?
Samuel Just : I think we'll set a special relevance flag or something. I mean that's what we'll do for replicated pgs.
Loic Dachary : what if you have ten backends, each needing different sets of flags of the same kind ? Will you grow indefinitely coll_t ?
Samuel Just : Actually coll_t is not the problem. At that point we would also need to grow the way in which ... so OSDs have a bunch of internal maps related to placement groups and those need to differentiate between ( in this case shards but ... ) I can see your point about other backends. Other flavors of the same placement group. So coll_t is relatively easy to use : you have coll_t take a whatever we make pg_t wrapper and and a string or something. It's the pg_t type that's the problem. So you're right : the direction we're going now would involve adding another tuple field for each backend, assuming the chunk_id it could not just be overloaded but I don't think that's worth worrying a lot about right now.
Loic Dachary : ok, I don't mind.
Samuel Just : unless you have a different backend in mind that we could work on ?Loic Dachary : no :-) I had two other questions about snapshots and watchers and I guess they are kind of outside of the scope and won't be supported by erasure coded pgs.
Samuel Just : they could be.
Loic Dachary : the watchers are going to be supported because it's not dependent on erasure coding or replicas, right ?
Samuel Just : no, it's equivalent to an xattr write.
Sage Weil : I think the snapshot would be easy because you can easily roll it back.
Samuel Just : yes.
Sage Weil : a clone event and then if you don't have enough you just ... yes. So I think snapshots are yes.
Loic Dachary : they would be supported ? Hum that's hum...
Samuel Just : I mean they could be. But I don't know if we want to bother with it. Unless your tiering proposal needs it.
Loic Dachary : ok, they could be.
Samuel Just : so watch would act a lot like an xattr write. Si you would simply log whatever piece of the object you need to recover. So the old set of watchers. Loic Dachary : another quesiton I had was related to the PGBackend and the PGBackendInterface. Essentially in a mail I advocated for having a PGBackendInterface which is the abstract interface which you described in PGBackend.h but it would then be derived into a PGBackend including PGLogs for instance or PGRegistry that contains the ObjectContext, SnapSetContext so that it can be used by any backend. The upside of having a PGBackend interface is it makes it easier for unit tests to create conditions and trick the PGBackend into thinking that something goes wrong in a way you can synthetise of you're in control of the PGBackend interface. Do I make any sense ?
Samuel Just : PGBackend could just easily be name PGBackendInterface it's an abstract interface. I did not add the "Interface" because it contains so many abstracts anyway :-)
Sage Weil : If we want to make a clean interface you don't want to put a lot of code in the parent class.
Loic Dachary : to conclude : one of the great things about the document you wrote Sam is that it makes it much easier for people to get involved because it divides the big work into much smaller and attainable parts. So, if anyone is willing to join at this point, I think Sam managed to create a situation where the most difficult design decisions have been made and I feel confortable jumping into coding a part of it without worrying about the general direction of the implementation of erasure coding. And this is great. I would not call it a low hanging fruit, but it's most definitely the easiest part to get involved with, is the erasure coding plugin API. I spoke to a number of people who want to try new erasure coding strategies, algorithms etc. and it appears that the logic for doing the erasure coding, the coding of the thirteen chunks I talked about during the introduction is kind of tricky and we have a library that does the job and it is efficient. But afterwards it may be interesting to use, for instance local recovery erasure codes, and in order to do that we defined an abstract API that can be implemented by plugins and be dynamically loaded. There is a blueprint and tickets for that and it's probably a matter of a few weeks work at most for someone who does not even know the Ceph code base. That would be the easiest task to tackle. There is a sort of home page for the erasure code work which is #4929 and from there you can find whatever you want to work on.
Sage Weil : Vincent had another question on IRC about how efficient this will be for small I/Os. I think the answer is at the bottom of the pad. Short answer : probably not very because you have one I/O that turns into 13 I/Os on 13 OSDs and this isn't ideal.
Samuel Just : RBD may not count as a small object per-se though. Especially if the process of moving an RBD image to cold storage also involves aggregating the 4MB chunks into bigger chunks.
Sage Weil : or you can change the 4MB chunks into 16MB chunks and then each I/O is about 1MB and you get the right balance. ( Reading question from IRC ) And what about erasure coding for RBD ? Yes and no. See : it's hard to do a random mutation to an existing object with erasure coding, it's not efficient to do that they are only really usefull for non ??? that's being modified. The tiering stuff will hopefully be able to take a cold block and push it to erasure coded pg but then in order to write it, it would have to pull it back again to do the mutation in the replicated pool.
Loic Dachary : in the developer notes that are linked in the PAD I described what it would mean to support partial writes with erasure coding and it's more complicated, not just less efficient, but it's going to be complicated to implement. Is it worth it if you have automatic tiering ? I'm not sure, we'll see.
Sage Weil : yes
Samuel Just : tiering may be another kind of latency.