Fix #10363: OSDMonitor setcrushmap tests take a long time on erasure coded rulesets - Ceph - Ceph

Actions

Copy link

Fix #10363

closed

OSDMonitor setcrushmap tests take a long time on erasure coded rulesets

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Loïc Dachary

Category:

Monitor

Target version:

% Done:

100%

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://workbench.dachary.org/ceph/ceph/blob/giant/src/mon/OSDMonitor.cc#L4007 runs tests by trying to map from min_size to max_size items for each ruleset. The default erasure code ruleset is:

rule erasure-code {
    ruleset 6
    type erasure
    min_size 3
    max_size 20
    step set_chooseleaf_tries 5
    step take default
    step chooseleaf indep 0 type host
    step emit
}

In a cluster with too few OSDs, each attempt to map more OSDs than available will exhaust all retries (50) which turns out to be expensive. In a cluster with 9 OSDs, it takes 5seconds.

$ time crushtool -i /tmp/crushhost --test --show-bad-mappings --rule 6 
user    0m4.921s

Since the test blocks the MON leader, it a few erasure coded rulesets will block the monitor long enough to exceed the timeouts and it will trigger an election.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Tracker changed from Bug to Fix
Assignee set to Loïc Dachary

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Backport set to firefly,giant

Actions

Copy link

Updated by Yann Dupont over 9 years ago

confirmed. Things can go even worse when you're setting non-default retries on some rules (that is : step set_choose_tries 200 ). This can lead to an election storm between monitors.

Actions

Copy link

Updated by Loïc Dachary over 9 years ago

Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/3194

Actions

Copy link

Updated by Loïc Dachary about 9 years ago

Status changed from Fix Under Review to Resolved
% Done changed from 0 to 100
Backport deleted (~~firefly,giant~~)

removing the backport : this really is an optimization that does not qualify for backports. It probably would if people were complaining about it but it does not seem to be the case.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Fix #10363

OSDMonitor setcrushmap tests take a long time on erasure coded rulesets

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Yann Dupont over 9 years ago

Updated by Loïc Dachary over 9 years ago

Updated by Loïc Dachary about 9 years ago