Feature #7238
erasure code : implement LRC plugin
100%
Description
Related issues
Associated revisions
erasure-code: locally repairable code plugin
Recursively apply erasure code techniques so that recovering from the
loss of some chunks only require a subset of the available chunks, most
of the time.
http://tracker.ceph.com/issues/7238 Fixes: #7238
Signed-off-by: Loic Dachary <loic@dachary.org>
History
#1 Updated by Loïc Dachary about 9 years ago
- Parent task set to #4929
#2 Updated by Loïc Dachary about 9 years ago
- Status changed from In Progress to Fix Under Review
- % Done changed from 0 to 80
#3 Updated by Loïc Dachary about 9 years ago
- Parent task changed from #4929 to #7266
#4 Updated by Loïc Dachary about 9 years ago
- Status changed from Fix Under Review to In Progress
refactoring to not introduce new API functions as it turns out to be more complicated
#5 Updated by Loïc Dachary about 9 years ago
- Description updated (diff)
#6 Updated by Loïc Dachary almost 9 years ago
- Parent task deleted (
#7266)
#7 Updated by Loïc Dachary almost 9 years ago
- Tracker changed from Subtask to Feature
#8 Updated by Loïc Dachary almost 9 years ago
- Target version set to 0.83
#9 Updated by Loïc Dachary almost 9 years ago
- Subject changed from erasure code : implement pyramid plugin to erasure code : implement lrc plugin
#10 Updated by Loïc Dachary almost 9 years ago
- Subject changed from erasure code : implement lrc plugin to erasure code : implement LRC plugin
#11 Updated by Loïc Dachary almost 9 years ago
- Description updated (diff)
#12 Updated by Loïc Dachary almost 9 years ago
- Status changed from In Progress to Fix Under Review
#13 Updated by Loïc Dachary almost 9 years ago
- Status changed from Fix Under Review to 7
#14 Updated by Loïc Dachary almost 9 years ago
- File wip-7238-lrc.yaml View added
One job died running it again with wip-7238-lrc.yaml
#15 Updated by Loïc Dachary almost 9 years ago
none of the failed are related to erasure coded pools ( the config file does not contain the string ec_ hence no erasure coded pool was created ).
#16 Updated by Loïc Dachary almost 9 years ago
the job died because it could not reach paddles, not because the job itself died.
2014-06-12T12:50:21.894 INFO:teuthology.run:Summary data: {description: 'rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml thrashers/mapgap.yaml workloads/ec-radosbench.yaml}', duration: 3391.9412620067596, flavor: basic, mon.a-kernel-sha1: 22001f619f29ddf66582d834223dcff4c0b74595, mon.b-kernel-sha1: 22001f619f29ddf66582d834223dcff4c0b74595, owner: scheduled_loic@fold, success: true}
#17 Updated by Loïc Dachary almost 9 years ago
- Description updated (diff)
- Status changed from 7 to Fix Under Review
#18 Updated by Ian Colle almost 9 years ago
- Target version changed from 0.83 to 0.83 cont.
#19 Updated by Samuel Just over 8 years ago
- Target version changed from 0.83 cont. to 0.84
#20 Updated by Samuel Just over 8 years ago
- Target version changed from 0.84 to 0.85 cont.
#21 Updated by Samuel Just over 8 years ago
- Status changed from Fix Under Review to In Progress
#22 Updated by Loïc Dachary over 8 years ago
- Status changed from In Progress to 7
Teuthology job description:
- rados: clients: - client.0 ec_pool: true erasure_code_profile: name: "LRCprofile" plugin: "LRC" ruleset-steps: "[ [ \"chooseleaf\", \"osd\", 0 ] ]" layers: "[ [ \"DDc\", \"\" ] ]" mapping: "DD_" objects: 500 op_weights: append: 45 delete: 10 read: 45 write: 0 ops: 4000
#23 Updated by Loïc Dachary over 8 years ago
#24 Updated by Loïc Dachary over 8 years ago
Fixed a few problems and running a firefly upgrade suite
#25 Updated by Loïc Dachary over 8 years ago
Cancel the teuthology run that did not contain any LRC workload and run another
#26 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
#27 Updated by Loïc Dachary over 8 years ago
canceled the previous job because it did not have enough OSD to complete (the LRC rule requires a minimum of 8 for each PG). schedule another job
#28 Updated by Loïc Dachary over 8 years ago
There is no need to test upgrade on a plugin that does not exist in LRC.
#29 Updated by Loïc Dachary over 8 years ago
Reserved three machines and run the following job on them:
os_type: ubuntu os_version: '14.04' overrides: ceph: conf: global: osd heartbeat grace: 40 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - scrub mismatch - ScrubResult ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: branch: wip-7238-lrc-plugin roles: - - mon.a - osd.0 - osd.1 - osd.2 - osd.3 - - mon.b - mon.c - osd.4 - osd.5 - osd.6 - osd.7 - - client.0 - osd.8 - osd.9 - osd.10 - osd.11 - osd.12 - osd.13 - osd.14 - osd.15 - osd.16 - osd.17 - osd.18 - osd.19 - osd.20 suite_path: /home/loic/software/ceph/ceph-qa-suite tasks: - install: branch: wip-7238-lrc-plugin - ceph: fs: xfs - rados: clients: [client.0] ops: 4000 objects: 500 ec_pool: true erasure_code_profile: name: LRCprofile plugin: LRC k: 4 m: 2 l: 3 ruleset-failure-domain: osd op_weights: read: 45 write: 0 append: 45 delete: 10
#30 Updated by Loïc Dachary over 8 years ago
Fixed a bug that made the plugin incorrectly claiming it could not recover when the last OSD was out, running tests again.
#31 Updated by Loïc Dachary over 8 years ago
- Status changed from 7 to Fix Under Review
Although thrashing tests using an LRC pool fail, I believe this is due to the size of the pool rather than the plugin itself. See http://tracker.ceph.com/issues/9209 for instance.
#32 Updated by Loïc Dachary over 8 years ago
- Status changed from Fix Under Review to 7
#33 Updated by Loïc Dachary over 8 years ago
- Status changed from 7 to Fix Under Review
The rados test work (no thrashing).
#34 Updated by Loïc Dachary over 8 years ago
The rados test is running, with thrashing after a rebase on master.
#35 Updated by Loïc Dachary over 8 years ago
It crashes the OSD in decode() but the plugin is silent on the reason why it refuses to decode. Adding debug information to analyze in the logs and running the tests again.
#36 Updated by Loïc Dachary over 8 years ago
IsRecoverablePredicate must not assume the object is recoverable if the first K chunks are available. This is no longer true since data chunks can be remapped.
#37 Updated by Loïc Dachary over 8 years ago
the above teuthology test ran successfully, with trashosd
2014-08-28 07:36:32,336.336 INFO:teuthology.run:Summary data: {duration: 6015.952927112579, flavor: basic, owner: loic@dachary.org, success: true}
#38 Updated by Loïc Dachary over 8 years ago
thrashosd passed because it had enough OSD to never be in a situation where mapping fails. When this happens, it triggers http://tracker.ceph.com/issues/9263 which comes from the fact that decode_chunks skips the layer which do not contain chunks that are to be read. This is find most of the time but when a layer is required to repair a chunk that will help an upper layer, it fails.
The problem has been fixed and the thrasher runs again with a number of OSD that has been verified to create bad mappings.
os_type: ubuntu os_version: '14.04' overrides: ceph: conf: global: osd heartbeat grace: 40 mon: debug mon: 20 debug ms: 1 debug paxos: 20 mon warn on legacy crush tunables: false osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 log-whitelist: - slow request - scrub mismatch - ScrubResult ceph-deploy: branch: dev: next conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: branch: wip-7238-lrc-plugin roles: - - mon.a - osd.0 - osd.1 - osd.2 - osd.3 - - mon.b - mon.c - osd.4 - osd.5 - osd.6 - osd.7 - - client.0 - osd.8 - osd.9 - osd.10 - osd.11 - osd.12 suite_path: /home/loic/software/ceph/ceph-qa-suite tasks: - install: branch: wip-7238-lrc-plugin - ceph: fs: xfs - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: [client.0] ops: 4000 objects: 500 ec_pool: true erasure_code_profile: name: LRCprofile plugin: LRC k: 4 m: 2 l: 3 ruleset-failure-domain: osd op_weights: read: 45 write: 0 append: 45 delete: 10
#39 Updated by Loïc Dachary over 8 years ago
the above teuthology test ran successfully, with trashosd
2014-08-29 07:13:31,922.922 INFO:teuthology.run:Summary data: {duration: 5186.044275045395, flavor: basic, owner: loic@dachary.org, success: true}
#40 Updated by Loïc Dachary over 8 years ago
- Status changed from Fix Under Review to Resolved
- % Done changed from 80 to 100