Feature #9343
erasure-code: allow upgrades for lrc and isa plugins
100%
Description
When upgrading from Firefly to Giant, an erasure coded pool using the two newly supported plugins (lrc & isa) must only be created when they are available cluster wide. The general solution is addressed in #7291 and an interim solution is needed.
Related issues
Associated revisions
erasure-code: CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 integration tests
http://tracker.ceph.com/issues/9343 Refs: #9343
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
erasure-code: isa/lrc plugin feature
There are two new plugins (isa and lrc). When upgrading a cluster, there
must be a protection against the following scenario:
- the mon are upgraded but not the osd
- a new pool is created using plugin isa
- the osd fail to load the isa plugin because they have not been
upgraded
A feature bit is added : PLUGINS_V2. The monitor will only agree to
create an erasure code profile for the isa or lrc plugin if all OSDs
supports PLUGINS_V2. Once such an erasure code profile is stored in the
OSDMap, an OSD can only boot if it supports the PLUGINS_V2 feature,
which means it is able to load the isa and lrc plugins.
The monitors will only activate the PLUGINS_V2 feature if all monitors
in the quorum support it. It protects against the following scenario:
- the leader is upgraded the peons are not upgraded
- the leader creates a pool with plugin=lrc because all OSD have
the PLUGINS_V2 feature - the leader goes down and a non upgraded peon becomes the leader
- an old OSD tries to join the cluster
- the new leader will let the OSD boot because it does not contain
the logic that would excluded it - the old OSD will fail when required to load the plugin lrc
This is going to be needed each time new plugins are added, which is
impractical. A more generic plugin upgrade support should be added
instead, as described in http://tracker.ceph.com/issues/7291.
http://tracker.ceph.com/issues/9343 Refs: #9343
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
erasure-code: CEPH_FEATURE_ERASURE_CODE_PLUGINS_V2 integration tests
http://tracker.ceph.com/issues/9343 Refs: #9343
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
History
#1 Updated by Loïc Dachary about 9 years ago
- Priority changed from Normal to Urgent
#2 Updated by Loïc Dachary about 9 years ago
- Description updated (diff)
#3 Updated by Loïc Dachary about 9 years ago
#4 Updated by Loïc Dachary about 9 years ago
- Status changed from In Progress to 7
- % Done changed from 50 to 90
running monthrash suite
#5 Updated by Loïc Dachary about 9 years ago
running monthrash against master to get a baseline and see if these errors are related to the changes in this branch. Got failures that are unrelated to the test suite, wait until they are fixed.
#6 Updated by Loïc Dachary about 9 years ago
scheduled a monthrash as teuthology seems to be running fine at the moment. It can be compared with the results of another monthrash currently running to figure out what failures to expect
#7 Updated by Loïc Dachary about 9 years ago
What was supposed to be the baseline failed more than the scheduled monthrash for this fix . Running a monthrash against giant
#8 Updated by Loïc Dachary about 9 years ago
The monthrash against giant was 100% successfull. Investingating and rebuilding the branch with the patch after a rebase (it was ~150 commits behind).
#9 Updated by Loïc Dachary about 9 years ago
The logs of the failed test shows
2014-09-19T18:23:26.329 INFO:tasks.workunit.client.0.plana34.stderr:+ ceph_test_rados_api_io 2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:Running main() from gtest_main.cc 2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[==========] Running 43 tests from 4 test cases. 2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[----------] Global test environment set-up. 2014-09-19T18:23:26.335 INFO:tasks.workunit.client.0.plana34.stdout:[----------] 11 tests from LibRadosIo 2014-09-19T18:23:27.508 INFO:tasks.workunit.client.0.plana34.stdout:[ RUN ] LibRadosIo.SimpleWrite 2014-09-19T18:23:29.018 INFO:tasks.workunit.client.0.plana34.stdout:[ OK ] LibRadosIo.SimpleWrite (1509 ms) 2014-09-19T18:23:29.018 INFO:tasks.workunit.client.0.plana34.stdout:[ RUN ] LibRadosIo.ReadTimeout 2014-09-19T18:23:29.142 INFO:tasks.workunit.client.0.plana34.stderr:Segmentation fault (core dumped)
And the file from ubuntu@teuthology:/a/ubuntu-2014-09-19_04:50:17-rados:monthrash-wip-9343-erasure-code-feature-testing-basic-multi/497498/remote/plana34/coredump/* shows it is from ceph_test_rados_api_io which is #9508
#10 Updated by Loïc Dachary about 9 years ago
- Status changed from 7 to Fix Under Review
Rebased the pull request against giant https://github.com/ceph/ceph/pull/2551
#11 Updated by Loïc Dachary about 9 years ago
- Status changed from Fix Under Review to Resolved
- % Done changed from 90 to 100
#12 Updated by Loïc Dachary about 9 years ago
- Target version changed from 0.88 to 0.86