Project

General

Profile

Feature #9343

erasure-code: allow upgrades for lrc and isa plugins

Added by Loïc Dachary about 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
Development
Tags:
Backport:
Reviewed:
Affected Versions:
Pull request ID:

Description

When upgrading from Firefly to Giant, an erasure coded pool using the two newly supported plugins (lrc & isa) must only be created when they are available cluster wide. The general solution is addressed in #7291 and an interim solution is needed.


Related issues

Related to Ceph - Feature #7291: EC: add mechanism for mon to detect and whitelist EC plugins which are globally available New
Copied to Ceph - Feature #10887: erasure-code: allow upgrades for shec plugins Resolved 09/04/2014

Associated revisions

Revision 25d25370 (diff)
Added by Loic Dachary about 9 years ago

erasure-code: CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 integration tests

http://tracker.ceph.com/issues/9343 Refs: #9343

Signed-off-by: Loic Dachary <>

Revision 9687150c (diff)
Added by Loic Dachary about 9 years ago

erasure-code: isa/lrc plugin feature

There are two new plugins (isa and lrc). When upgrading a cluster, there
must be a protection against the following scenario:

  • the mon are upgraded but not the osd
  • a new pool is created using plugin isa
  • the osd fail to load the isa plugin because they have not been
    upgraded

A feature bit is added : PLUGINS_V2. The monitor will only agree to
create an erasure code profile for the isa or lrc plugin if all OSDs
supports PLUGINS_V2. Once such an erasure code profile is stored in the
OSDMap, an OSD can only boot if it supports the PLUGINS_V2 feature,
which means it is able to load the isa and lrc plugins.

The monitors will only activate the PLUGINS_V2 feature if all monitors
in the quorum support it. It protects against the following scenario:

  • the leader is upgraded the peons are not upgraded
  • the leader creates a pool with plugin=lrc because all OSD have
    the PLUGINS_V2 feature
  • the leader goes down and a non upgraded peon becomes the leader
  • an old OSD tries to join the cluster
  • the new leader will let the OSD boot because it does not contain
    the logic that would excluded it
  • the old OSD will fail when required to load the plugin lrc

This is going to be needed each time new plugins are added, which is
impractical. A more generic plugin upgrade support should be added
instead, as described in http://tracker.ceph.com/issues/7291.

http://tracker.ceph.com/issues/9343 Refs: #9343

Signed-off-by: Loic Dachary <>

Revision 75ee20da (diff)
Added by Loic Dachary almost 9 years ago

erasure-code: CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 integration tests

http://tracker.ceph.com/issues/9343 Refs: #9343

Signed-off-by: Loic Dachary <>

History

#1 Updated by Loïc Dachary about 9 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Loïc Dachary about 9 years ago

  • Description updated (diff)

#4 Updated by Loïc Dachary about 9 years ago

  • Status changed from In Progress to 7
  • % Done changed from 50 to 90

#5 Updated by Loïc Dachary about 9 years ago

running monthrash against master to get a baseline and see if these errors are related to the changes in this branch. Got failures that are unrelated to the test suite, wait until they are fixed.

#6 Updated by Loïc Dachary about 9 years ago

scheduled a monthrash as teuthology seems to be running fine at the moment. It can be compared with the results of another monthrash currently running to figure out what failures to expect

#8 Updated by Loïc Dachary about 9 years ago

The monthrash against giant was 100% successfull. Investingating and rebuilding the branch with the patch after a rebase (it was ~150 commits behind).

#9 Updated by Loïc Dachary about 9 years ago

The logs of the failed test shows

2014-09-19T18:23:26.329 INFO:tasks.workunit.client.0.plana34.stderr:+ ceph_test_rados_api_io
2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:Running main() from gtest_main.cc
2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[==========] Running 43 tests from 4 test cases.
2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[----------] Global test environment set-up.
2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[----------] 11 tests from LibRadosIo
2014-09-19T18:23:27.508 INFO:tasks.workunit.client.0.plana34.stdout:[ RUN      ] LibRadosIo.SimpleWrite
2014-09-19T18:23:29.018 INFO:tasks.workunit.client.0.plana34.stdout:[       OK ] LibRadosIo.SimpleWrite (1509 ms)
2014-09-19T18:23:29.018 INFO:tasks.workunit.client.0.plana34.stdout:[ RUN      ] LibRadosIo.ReadTimeout
2014-09-19T18:23:29.142 INFO:tasks.workunit.client.0.plana34.stderr:Segmentation fault (core dumped)

And the file from ubuntu@teuthology:/a/ubuntu-2014-09-19_04:50:17-rados:monthrash-wip-9343-erasure-code-feature-testing-basic-multi/497498/remote/plana34/coredump/* shows it is from ceph_test_rados_api_io which is #9508

#10 Updated by Loïc Dachary about 9 years ago

  • Status changed from 7 to Fix Under Review

Rebased the pull request against giant https://github.com/ceph/ceph/pull/2551

#11 Updated by Loïc Dachary about 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 90 to 100

#12 Updated by Loïc Dachary about 9 years ago

  • Target version changed from 0.88 to 0.86

Also available in: Atom PDF