Bug #9492
closedCrush Mapper crashes when number of replicas is less than total number of osds to be selected.
100%
Description
1. ./crushtool --outfn crushmap --build --num_osds 100 host straw 4 rack straw 10 default straw 0
2../crushtool -d crushmap -o crushmap.txt
3. Add to crushmap.txt
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 4 type host
step emit
}
4. ./crushtool -c crushmap.txt -o crushmap
5. ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep=3
(Same is the case for --num-rep=1 or --num-rep=2)
Updated by Johnu George over 9 years ago
Seg fault log:
CRUSH*** Caught signal (Segmentation fault)
in thread 7f3dcb0007c0
ceph version 0.85-778-gb285788 (b285788c56f8be53eb51204ea3154c49c577d337)
1: ./crushtool() [0x4f74ca]
2: (()+0x10340) [0x7f3dca9cd340]
3: ./crushtool() [0x5ba2d0]
4: (crush_do_rule()+0x236) [0x5bad96]
5: (CrushTester::test()+0xcc7) [0x513b47]
6: (main()+0xda3) [0x4eccb3]
7: (__libc_start_main()+0xf5) [0x7f3dc92f3ec5]
8: ./crushtool() [0x4f13b7]
2014-09-16 16:46:37.310562 7f3dcb0007c0 -1 Caught signal (Segmentation fault) *
in thread 7f3dcb0007c0
ceph version 0.85-778-gb285788 (b285788c56f8be53eb51204ea3154c49c577d337)
1: ./crushtool() [0x4f74ca]
2: (()+0x10340) [0x7f3dca9cd340]
3: ./crushtool() [0x5ba2d0]
4: (crush_do_rule()+0x236) [0x5bad96]
5: (CrushTester::test()+0xcc7) [0x513b47]
6: (main()+0xda3) [0x4eccb3]
7: (__libc_start_main()+0xf5) [0x7f3dc92f3ec5]
8: ./crushtool() [0x4f13b7]
Updated by Johnu George over 9 years ago
The issue is that crush temporary buffers(scratch array) are allocated as per size of num_replica configured by the user. When there are more osds(to be selected as per the rule) than the replicas, buffer overlaps and it causes the crash
Updated by Loïc Dachary over 9 years ago
- Category set to 10
- Status changed from New to Fix Under Review
Updated by Loïc Dachary over 9 years ago
Running in debug mode with https://github.com/ceph/ceph/pull/2568 (using the crushmap created as in the description):
$ valgrind --tool=memcheck ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 2 ==8349== Memcheck, a memory error detector ==8349== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==8349== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info ==8349== Command: ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 2 ==8349== rule 1 (myrule), x = 1..10, numrep = 2..2 CRUSHCHOOSE bucket -29 x 1 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=1 r=0 item -28 type 2 CHOOSE got -28 crush_bucket_choose -29 x=1 r=1 item -27 type 2 CHOOSE got -27 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 1 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -28 x=1 r=0 item -24 type 1 CHOOSE bucket -24 x 1 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -24 x=1 r=0 item 93 type 0 CHOOSE got 93 CHOOSE returns 1 CHOOSE got -24 crush_bucket_choose -28 x=1 r=1 item -21 type 1 CHOOSE bucket -21 x 1 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -21 x=1 r=1 item 82 type 0 CHOOSE got 82 CHOOSE returns 2 CHOOSE got -21 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 1 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 1 [93,82] CRUSHCHOOSE bucket -29 x 2 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=2 r=0 item -28 type 2 CHOOSE got -28 crush_bucket_choose -29 x=2 r=1 item -27 type 2 CHOOSE got -27 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 2 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -28 x=2 r=0 item -21 type 1 CHOOSE bucket -21 x 2 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -21 x=2 r=0 item 81 type 0 CHOOSE got 81 CHOOSE returns 1 CHOOSE got -21 crush_bucket_choose -28 x=2 r=1 item -22 type 1 CHOOSE bucket -22 x 2 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -22 x=2 r=1 item 86 type 0 CHOOSE got 86 CHOOSE returns 2 CHOOSE got -22 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 2 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 2 [81,86] CRUSHCHOOSE bucket -29 x 3 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=3 r=0 item -27 type 2 CHOOSE got -27 crush_bucket_choose -29 x=3 r=1 item -26 type 2 CHOOSE got -26 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 3 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -27 x=3 r=0 item -16 type 1 CHOOSE bucket -16 x 3 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -16 x=3 r=0 item 63 type 0 CHOOSE got 63 CHOOSE returns 1 CHOOSE got -16 crush_bucket_choose -27 x=3 r=1 item -16 type 1 reject 0 collide 1 ftotal 1 flocal 1 crush_bucket_choose -27 x=3 r=2 item -14 type 1 CHOOSE bucket -14 x 3 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -14 x=3 r=1 item 54 type 0 CHOOSE got 54 CHOOSE returns 2 CHOOSE got -14 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 3 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 3 [63,54] CRUSHCHOOSE bucket -29 x 4 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=4 r=0 item -27 type 2 CHOOSE got -27 crush_bucket_choose -29 x=4 r=1 item -26 type 2 CHOOSE got -26 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 4 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -27 x=4 r=0 item -12 type 1 CHOOSE bucket -12 x 4 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -12 x=4 r=0 item 45 type 0 CHOOSE got 45 CHOOSE returns 1 CHOOSE got -12 crush_bucket_choose -27 x=4 r=1 item -14 type 1 CHOOSE bucket -14 x 4 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -14 x=4 r=1 item 52 type 0 CHOOSE got 52 CHOOSE returns 2 CHOOSE got -14 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 4 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 4 [45,52] CRUSHCHOOSE bucket -29 x 5 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=5 r=0 item -28 type 2 CHOOSE got -28 crush_bucket_choose -29 x=5 r=1 item -26 type 2 CHOOSE got -26 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 5 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -28 x=5 r=0 item -25 type 1 CHOOSE bucket -25 x 5 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -25 x=5 r=0 item 96 type 0 CHOOSE got 96 CHOOSE returns 1 CHOOSE got -25 crush_bucket_choose -28 x=5 r=1 item -22 type 1 CHOOSE bucket -22 x 5 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -22 x=5 r=1 item 84 type 0 CHOOSE got 84 CHOOSE returns 2 CHOOSE got -22 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 5 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 5 [96,84] CRUSHCHOOSE bucket -29 x 6 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=6 r=0 item -28 type 2 CHOOSE got -28 crush_bucket_choose -29 x=6 r=1 item -27 type 2 CHOOSE got -27 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 6 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -28 x=6 r=0 item -22 type 1 CHOOSE bucket -22 x 6 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -22 x=6 r=0 item 86 type 0 CHOOSE got 86 CHOOSE returns 1 CHOOSE got -22 crush_bucket_choose -28 x=6 r=1 item -23 type 1 CHOOSE bucket -23 x 6 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -23 x=6 r=1 item 90 type 0 CHOOSE got 90 CHOOSE returns 2 CHOOSE got -23 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 6 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 6 [86,90] CRUSHCHOOSE bucket -29 x 7 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=7 r=0 item -26 type 2 CHOOSE got -26 crush_bucket_choose -29 x=7 r=1 item -28 type 2 CHOOSE got -28 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 7 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -26 x=7 r=0 item -10 type 1 CHOOSE bucket -10 x 7 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -10 x=7 r=0 item 38 type 0 CHOOSE got 38 CHOOSE returns 1 CHOOSE got -10 crush_bucket_choose -26 x=7 r=1 item -2 type 1 CHOOSE bucket -2 x 7 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -2 x=7 r=1 item 4 type 0 CHOOSE got 4 CHOOSE returns 2 CHOOSE got -2 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 7 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 7 [38,4] CRUSHCHOOSE bucket -29 x 8 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=8 r=0 item -26 type 2 CHOOSE got -26 crush_bucket_choose -29 x=8 r=1 item -27 type 2 CHOOSE got -27 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 8 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -26 x=8 r=0 item -6 type 1 CHOOSE bucket -6 x 8 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -6 x=8 r=0 item 22 type 0 CHOOSE got 22 CHOOSE returns 1 CHOOSE got -6 crush_bucket_choose -26 x=8 r=1 item -2 type 1 CHOOSE bucket -2 x 8 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -2 x=8 r=1 item 6 type 0 CHOOSE got 6 CHOOSE returns 2 CHOOSE got -2 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 8 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 8 [22,6] CRUSHCHOOSE bucket -29 x 9 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=9 r=0 item -26 type 2 CHOOSE got -26 crush_bucket_choose -29 x=9 r=1 item -28 type 2 CHOOSE got -28 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 9 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -26 x=9 r=0 item -4 type 1 CHOOSE bucket -4 x 9 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -4 x=9 r=0 item 12 type 0 CHOOSE got 12 CHOOSE returns 1 CHOOSE got -4 crush_bucket_choose -26 x=9 r=1 item -4 type 1 reject 0 collide 1 ftotal 1 flocal 1 crush_bucket_choose -26 x=9 r=2 item -1 type 1 CHOOSE bucket -1 x 9 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -1 x=9 r=1 item 1 type 0 CHOOSE got 1 CHOOSE returns 2 CHOOSE got -1 CHOOSE returns 2 CHOOSE_LEAF bucket -28 x 9 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 9 [12,1] CRUSHCHOOSE bucket -29 x 10 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -29 x=10 r=0 item -26 type 2 CHOOSE got -26 crush_bucket_choose -29 x=10 r=1 item -26 type 2 reject 0 collide 1 ftotal 1 flocal 1 crush_bucket_choose -29 x=10 r=2 item -27 type 2 CHOOSE got -27 CHOOSE returns 2 CHOOSE_LEAF bucket -26 x 10 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -26 x=10 r=0 item -6 type 1 CHOOSE bucket -6 x 10 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -6 x=10 r=0 item 23 type 0 CHOOSE got 23 CHOOSE returns 1 CHOOSE got -6 crush_bucket_choose -26 x=10 r=1 item -8 type 1 CHOOSE bucket -8 x 10 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0 crush_bucket_choose -8 x=10 r=1 item 28 type 0 CHOOSE got 28 CHOOSE returns 2 CHOOSE got -8 CHOOSE returns 2 CHOOSE_LEAF bucket -27 x 10 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0 CHOOSE returns 0 rule 1 x 10 [23,28] rule 1 (myrule) num_rep 2 result size == 2: 10/10 ==8349== ==8349== HEAP SUMMARY: ==8349== in use at exit: 17,936 bytes in 219 blocks ==8349== total heap usage: 884 allocs, 665 frees, 57,490 bytes allocated ==8349== ==8349== LEAK SUMMARY: ==8349== definitely lost: 0 bytes in 0 blocks ==8349== indirectly lost: 0 bytes in 0 blocks ==8349== possibly lost: 6,376 bytes in 147 blocks ==8349== still reachable: 11,560 bytes in 72 blocks ==8349== suppressed: 0 bytes in 0 blocks ==8349== Rerun with --leak-check=full to see details of leaked memory ==8349== ==8349== For counts of detected and suppressed errors, rerun with: -v ==8349== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
Updated by Johnu George over 9 years ago
Ran valgrind with the patch and no errors were found with different rule combinations of num_rep and number of osds to be selected.
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 4 type host
step emit
}
rule myrule2 {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 1 type host
step emit
}
rule myrule3 {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step choose firstn 3 type rack
step chooseleaf firstn 1 type host
step emit
}
rule myrule4 {
ruleset 4
type replicated
min_size 1
max_size 10
step take default
step choose firstn 4 type rack
step chooseleaf firstn 1 type host
step emit
}
ule myrule5 {
ruleset 5
type replicated
min_size 1
max_size 10
step take default
step choose firstn 4 type rack
step chooseleaf firstn 0 type host
step emit
}
rule myrule6 {
ruleset 6
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn -1 type host
step emit
}
rule myrule7 {
ruleset 7
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 0 type host
step emit
}
rule myrule8 {
ruleset 8
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 2 type host
step emit
}
rule myrule9 {
ruleset 9
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 1 type host
step emit
}
Updated by Loïc Dachary over 9 years ago
- Status changed from Fix Under Review to Resolved
- % Done changed from 0 to 100
Updated by Loïc Dachary over 9 years ago
- Status changed from Resolved to Need More Info
What happens with indep ?
Updated by Loïc Dachary over 9 years ago
- Status changed from Need More Info to Resolved
Updated by Loïc Dachary over 9 years ago
- Status changed from Resolved to Pending Backport
- Backport set to giant, firefly
I think both patches should be backported to giant and firefly. Would you like to do that ? It essentially means you should
git checkout -b wip-9492-crush-giant origin/giant git cherry-pick -x <commit hash> <commit hash>
and submit a pull request against giant instead of the default which is master. And then the same for firefly. There should not be any conflicts because I don't think the code changed in this area since firefly.
Updated by Johnu George over 9 years ago
Pull req info :
fix for firstn rules: https://github.com/ceph/ceph/pull/2568
fix for indep rules : https://github.com/ceph/ceph/pull/2599
Updated by Loïc Dachary over 9 years ago
- Status changed from Pending Backport to Fix Under Review
running on http://ceph.com/gitbuilder.cgi
Updated by Loïc Dachary over 9 years ago
- Status changed from Fix Under Review to Resolved
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to RADOS
- Category deleted (
10)