Project

General

Profile

Feature #7238

erasure code : implement LRC plugin

Added by Loïc Dachary about 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

ready for review

previous draft implementations 1 2

An erasure code plugin providing an implementation of ErasureCodeInterface. The caller can specify how to recursively apply erasure coding to the chunks to control the placement of the erasure coded chunks.

wip-7238-lrc.yaml View - wip-7238-lrc.yaml (1.96 KB) Loïc Dachary, 06/13/2014 08:51 AM


Related issues

Related to Ceph - Subtask #7146: implement osd crush rule create-erasure Resolved 01/14/2014
Related to Ceph - Subtask #6478: ErasureCode : XOR plugin Rejected 10/05/2013
Related to Ceph - Feature #8496: erasure-code: ErasureCode base class Resolved 05/31/2014
Blocked by Ceph - Feature #9025: erasure-code: chunk remapping Resolved 08/06/2014
Blocks Ceph - Feature #9034: erasure-code: better LRC strategy New

Associated revisions

Revision b0fd4815 (diff)
Added by Loïc Dachary over 9 years ago

erasure-code: locally repairable code plugin

Recursively apply erasure code techniques so that recovering from the
loss of some chunks only require a subset of the available chunks, most
of the time.

http://tracker.ceph.com/issues/7238 Fixes: #7238

Signed-off-by: Loic Dachary <>

History

#1 Updated by Loïc Dachary about 10 years ago

  • Parent task set to #4929

#2 Updated by Loïc Dachary about 10 years ago

  • Status changed from In Progress to Fix Under Review
  • % Done changed from 0 to 80

#3 Updated by Loïc Dachary about 10 years ago

  • Parent task changed from #4929 to #7266

#4 Updated by Loïc Dachary about 10 years ago

  • Status changed from Fix Under Review to In Progress

refactoring to not introduce new API functions as it turns out to be more complicated

#5 Updated by Loïc Dachary about 10 years ago

  • Description updated (diff)

#6 Updated by Loïc Dachary almost 10 years ago

  • Parent task deleted (#7266)

#7 Updated by Loïc Dachary almost 10 years ago

  • Tracker changed from Subtask to Feature

#8 Updated by Loïc Dachary almost 10 years ago

  • Target version set to 0.83

#9 Updated by Loïc Dachary almost 10 years ago

  • Subject changed from erasure code : implement pyramid plugin to erasure code : implement lrc plugin

#10 Updated by Loïc Dachary almost 10 years ago

  • Subject changed from erasure code : implement lrc plugin to erasure code : implement LRC plugin

#11 Updated by Loïc Dachary almost 10 years ago

  • Description updated (diff)

#12 Updated by Loïc Dachary almost 10 years ago

  • Status changed from In Progress to Fix Under Review

#13 Updated by Loïc Dachary almost 10 years ago

  • Status changed from Fix Under Review to 7

#14 Updated by Loïc Dachary almost 10 years ago

One job died running it again with wip-7238-lrc.yaml

#15 Updated by Loïc Dachary almost 10 years ago

none of the failed are related to erasure coded pools ( the config file does not contain the string ec_ hence no erasure coded pool was created ).

#16 Updated by Loïc Dachary almost 10 years ago

the job died because it could not reach paddles, not because the job itself died.

2014-06-12T12:50:21.894 INFO:teuthology.run:Summary data:
{description: 'rados/thrash/{clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml
    thrashers/mapgap.yaml workloads/ec-radosbench.yaml}', duration: 3391.9412620067596,
  flavor: basic, mon.a-kernel-sha1: 22001f619f29ddf66582d834223dcff4c0b74595, mon.b-kernel-sha1: 22001f619f29ddf66582d834223dcff4c0b74595,
  owner: scheduled_loic@fold, success: true}

#17 Updated by Loïc Dachary almost 10 years ago

  • Description updated (diff)
  • Status changed from 7 to Fix Under Review

#18 Updated by Ian Colle over 9 years ago

  • Target version changed from 0.83 to 0.83 cont.

#19 Updated by Samuel Just over 9 years ago

  • Target version changed from 0.83 cont. to 0.84

#20 Updated by Samuel Just over 9 years ago

  • Target version changed from 0.84 to 0.85 cont.

#21 Updated by Samuel Just over 9 years ago

  • Status changed from Fix Under Review to In Progress

#22 Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to 7

Teuthology job description:

- rados:
    clients:
    - client.0
    ec_pool: true
    erasure_code_profile:
      name: "LRCprofile" 
      plugin: "LRC" 
      ruleset-steps: "[ [ \"chooseleaf\", \"osd\", 0 ] ]" 
      layers: "[ [ \"DDc\", \"\" ] ]" 
      mapping: "DD_" 
    objects: 500
    op_weights:
      append: 45
      delete: 10
      read: 45
      write: 0
    ops: 4000

#24 Updated by Loïc Dachary over 9 years ago

Fixed a few problems and running a firefly upgrade suite

#26 Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)

#27 Updated by Loïc Dachary over 9 years ago

canceled the previous job because it did not have enough OSD to complete (the LRC rule requires a minimum of 8 for each PG). schedule another job

#28 Updated by Loïc Dachary over 9 years ago

There is no need to test upgrade on a plugin that does not exist in LRC.

#29 Updated by Loïc Dachary over 9 years ago

Reserved three machines and run the following job on them:

os_type: ubuntu
os_version: '14.04'
overrides:
  ceph:
    conf:
      global:
        osd heartbeat grace: 40
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      branch: wip-7238-lrc-plugin
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
  - osd.3
- - mon.b
  - mon.c
  - osd.4
  - osd.5
  - osd.6
  - osd.7
- - client.0
  - osd.8
  - osd.9
  - osd.10
  - osd.11
  - osd.12
  - osd.13
  - osd.14
  - osd.15
  - osd.16
  - osd.17
  - osd.18
  - osd.19
  - osd.20
suite_path: /home/loic/software/ceph/ceph-qa-suite
tasks:
- install:
    branch: wip-7238-lrc-plugin
- ceph:
    fs: xfs
- rados:
    clients: [client.0]
    ops: 4000
    objects: 500
    ec_pool: true
    erasure_code_profile:
      name: LRCprofile
      plugin: LRC
      k: 4
      m: 2
      l: 3
      ruleset-failure-domain: osd
    op_weights:
      read: 45
      write: 0
      append: 45
      delete: 10

#30 Updated by Loïc Dachary over 9 years ago

Fixed a bug that made the plugin incorrectly claiming it could not recover when the last OSD was out, running tests again.

#31 Updated by Loïc Dachary over 9 years ago

  • Status changed from 7 to Fix Under Review

Although thrashing tests using an LRC pool fail, I believe this is due to the size of the pool rather than the plugin itself. See http://tracker.ceph.com/issues/9209 for instance.

#32 Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to 7

#33 Updated by Loïc Dachary over 9 years ago

  • Status changed from 7 to Fix Under Review

The rados test work (no thrashing).

#34 Updated by Loïc Dachary over 9 years ago

The rados test is running, with thrashing after a rebase on master.

#35 Updated by Loïc Dachary over 9 years ago

It crashes the OSD in decode() but the plugin is silent on the reason why it refuses to decode. Adding debug information to analyze in the logs and running the tests again.

#36 Updated by Loïc Dachary over 9 years ago

IsRecoverablePredicate must not assume the object is recoverable if the first K chunks are available. This is no longer true since data chunks can be remapped.

#37 Updated by Loïc Dachary over 9 years ago

the above teuthology test ran successfully, with trashosd

2014-08-28 07:36:32,336.336 INFO:teuthology.run:Summary data:
{duration: 6015.952927112579, flavor: basic, owner: loic@dachary.org, success: true}

#38 Updated by Loïc Dachary over 9 years ago

thrashosd passed because it had enough OSD to never be in a situation where mapping fails. When this happens, it triggers http://tracker.ceph.com/issues/9263 which comes from the fact that decode_chunks skips the layer which do not contain chunks that are to be read. This is find most of the time but when a layer is required to repair a chunk that will help an upper layer, it fails.

The problem has been fixed and the thrasher runs again with a number of OSD that has been verified to create bad mappings.

os_type: ubuntu
os_version: '14.04'
overrides:
  ceph:
    conf:
      global:
        osd heartbeat grace: 40
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
        mon warn on legacy crush tunables: false
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
    log-whitelist:
    - slow request
    - scrub mismatch
    - ScrubResult
  ceph-deploy:
    branch:
      dev: next
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      branch: wip-7238-lrc-plugin
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
  - osd.3
- - mon.b
  - mon.c
  - osd.4
  - osd.5
  - osd.6
  - osd.7
- - client.0
  - osd.8
  - osd.9
  - osd.10
  - osd.11
  - osd.12
suite_path: /home/loic/software/ceph/ceph-qa-suite
tasks:
- install:
    branch: wip-7238-lrc-plugin
- ceph:
    fs: xfs
- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients: [client.0]
    ops: 4000
    objects: 500
    ec_pool: true
    erasure_code_profile:
      name: LRCprofile
      plugin: LRC
      k: 4
      m: 2
      l: 3
      ruleset-failure-domain: osd
    op_weights:
      read: 45
      write: 0
      append: 45
      delete: 10

#39 Updated by Loïc Dachary over 9 years ago

the above teuthology test ran successfully, with trashosd

2014-08-29 07:13:31,922.922 INFO:teuthology.run:Summary data:
{duration: 5186.044275045395, flavor: basic, owner: loic@dachary.org, success: true}

#40 Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 80 to 100

Also available in: Atom PDF