Project

General

Profile

Actions

Fix #10363

closed

OSDMonitor setcrushmap tests take a long time on erasure coded rulesets

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
-
% Done:

100%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

http://workbench.dachary.org/ceph/ceph/blob/giant/src/mon/OSDMonitor.cc#L4007 runs tests by trying to map from min_size to max_size items for each ruleset. The default erasure code ruleset is:

rule erasure-code {
    ruleset 6
    type erasure
    min_size 3
    max_size 20
    step set_chooseleaf_tries 5
    step take default
    step chooseleaf indep 0 type host
    step emit
}

In a cluster with too few OSDs, each attempt to map more OSDs than available will exhaust all retries (50) which turns out to be expensive. In a cluster with 9 OSDs, it takes 5seconds.
$ time crushtool -i /tmp/crushhost --test --show-bad-mappings --rule 6 
user    0m4.921s

Since the test blocks the MON leader, it a few erasure coded rulesets will block the monitor long enough to exceed the timeouts and it will trigger an election.

Actions #1

Updated by Loïc Dachary over 9 years ago

  • Tracker changed from Bug to Fix
  • Assignee set to Loïc Dachary
Actions #2

Updated by Loïc Dachary over 9 years ago

  • Backport set to firefly,giant
Actions #3

Updated by Yann Dupont over 9 years ago

confirmed. Things can go even worse when you're setting non-default retries on some rules (that is : step set_choose_tries 200 ). This can lead to an election storm between monitors.

Actions #4

Updated by Loïc Dachary over 9 years ago

  • Status changed from 12 to Fix Under Review
Actions #5

Updated by Loïc Dachary about 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 0 to 100
  • Backport deleted (firefly,giant)

removing the backport : this really is an optimization that does not qualify for backports. It probably would if people were complaining about it but it does not seem to be the case.

Actions

Also available in: Atom PDF