Project

General

Profile

Actions

Bug #9492

closed

Crush Mapper crashes when number of replicas is less than total number of osds to be selected.

Added by Johnu George over 9 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
other
Tags:
Backport:
giant, firefly
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1. ./crushtool --outfn crushmap --build --num_osds 100 host straw 4 rack straw 10 default straw 0
2../crushtool -d crushmap -o crushmap.txt
3. Add to crushmap.txt
rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 4 type host
step emit
}

4. ./crushtool -c crushmap.txt -o crushmap
5. ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep=3
(Same is the case for --num-rep=1 or --num-rep=2)


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #9490: crushtool crash if --num-rep is missingRejectedLoïc Dachary09/16/2014

Actions
Related to Ceph - Bug #9485: Monitor crash due to wrong crush rule setResolvedLoïc Dachary09/15/2014

Actions
Actions #1

Updated by Johnu George over 9 years ago

Seg fault log:
CRUSH*** Caught signal (Segmentation fault)
in thread 7f3dcb0007c0
ceph version 0.85-778-gb285788 (b285788c56f8be53eb51204ea3154c49c577d337)
1: ./crushtool() [0x4f74ca]
2: (()+0x10340) [0x7f3dca9cd340]
3: ./crushtool() [0x5ba2d0]
4: (crush_do_rule()+0x236) [0x5bad96]
5: (CrushTester::test()+0xcc7) [0x513b47]
6: (main()+0xda3) [0x4eccb3]
7: (__libc_start_main()+0xf5) [0x7f3dc92f3ec5]
8: ./crushtool() [0x4f13b7]
2014-09-16 16:46:37.310562 7f3dcb0007c0 -1
Caught signal (Segmentation fault) *
in thread 7f3dcb0007c0

ceph version 0.85-778-gb285788 (b285788c56f8be53eb51204ea3154c49c577d337)
1: ./crushtool() [0x4f74ca]
2: (()+0x10340) [0x7f3dca9cd340]
3: ./crushtool() [0x5ba2d0]
4: (crush_do_rule()+0x236) [0x5bad96]
5: (CrushTester::test()+0xcc7) [0x513b47]
6: (main()+0xda3) [0x4eccb3]
7: (__libc_start_main()+0xf5) [0x7f3dc92f3ec5]
8: ./crushtool() [0x4f13b7]
Actions #2

Updated by Johnu George over 9 years ago

The issue is that crush temporary buffers(scratch array) are allocated as per size of num_replica configured by the user. When there are more osds(to be selected as per the rule) than the replicas, buffer overlaps and it causes the crash

Actions #3

Updated by Loïc Dachary over 9 years ago

  • Category set to 10
  • Status changed from New to Fix Under Review
Actions #4

Updated by Loïc Dachary over 9 years ago

Running in debug mode with https://github.com/ceph/ceph/pull/2568 (using the crushmap created as in the description):

$ valgrind --tool=memcheck ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 2
==8349== Memcheck, a memory error detector
==8349== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==8349== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==8349== Command: ./crushtool -i crushmap --test --show-utilization --rule 1 --min-x 1 --max-x 10 --num-rep 2
==8349== 
rule 1 (myrule), x = 1..10, numrep = 2..2
CRUSHCHOOSE bucket -29 x 1 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=1 r=0
  item -28 type 2
CHOOSE got -28
 crush_bucket_choose -29 x=1 r=1
  item -27 type 2
CHOOSE got -27
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 1 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -28 x=1 r=0
  item -24 type 1
CHOOSE bucket -24 x 1 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -24 x=1 r=0
  item 93 type 0
CHOOSE got 93
CHOOSE returns 1
CHOOSE got -24
 crush_bucket_choose -28 x=1 r=1
  item -21 type 1
CHOOSE bucket -21 x 1 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -21 x=1 r=1
  item 82 type 0
CHOOSE got 82
CHOOSE returns 2
CHOOSE got -21
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 1 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 1 [93,82]
CRUSHCHOOSE bucket -29 x 2 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=2 r=0
  item -28 type 2
CHOOSE got -28
 crush_bucket_choose -29 x=2 r=1
  item -27 type 2
CHOOSE got -27
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 2 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -28 x=2 r=0
  item -21 type 1
CHOOSE bucket -21 x 2 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -21 x=2 r=0
  item 81 type 0
CHOOSE got 81
CHOOSE returns 1
CHOOSE got -21
 crush_bucket_choose -28 x=2 r=1
  item -22 type 1
CHOOSE bucket -22 x 2 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -22 x=2 r=1
  item 86 type 0
CHOOSE got 86
CHOOSE returns 2
CHOOSE got -22
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 2 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 2 [81,86]
CRUSHCHOOSE bucket -29 x 3 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=3 r=0
  item -27 type 2
CHOOSE got -27
 crush_bucket_choose -29 x=3 r=1
  item -26 type 2
CHOOSE got -26
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 3 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -27 x=3 r=0
  item -16 type 1
CHOOSE bucket -16 x 3 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -16 x=3 r=0
  item 63 type 0
CHOOSE got 63
CHOOSE returns 1
CHOOSE got -16
 crush_bucket_choose -27 x=3 r=1
  item -16 type 1
  reject 0  collide 1  ftotal 1  flocal 1
 crush_bucket_choose -27 x=3 r=2
  item -14 type 1
CHOOSE bucket -14 x 3 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -14 x=3 r=1
  item 54 type 0
CHOOSE got 54
CHOOSE returns 2
CHOOSE got -14
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 3 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 3 [63,54]
CRUSHCHOOSE bucket -29 x 4 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=4 r=0
  item -27 type 2
CHOOSE got -27
 crush_bucket_choose -29 x=4 r=1
  item -26 type 2
CHOOSE got -26
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 4 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -27 x=4 r=0
  item -12 type 1
CHOOSE bucket -12 x 4 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -12 x=4 r=0
  item 45 type 0
CHOOSE got 45
CHOOSE returns 1
CHOOSE got -12
 crush_bucket_choose -27 x=4 r=1
  item -14 type 1
CHOOSE bucket -14 x 4 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -14 x=4 r=1
  item 52 type 0
CHOOSE got 52
CHOOSE returns 2
CHOOSE got -14
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 4 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 4 [45,52]
CRUSHCHOOSE bucket -29 x 5 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=5 r=0
  item -28 type 2
CHOOSE got -28
 crush_bucket_choose -29 x=5 r=1
  item -26 type 2
CHOOSE got -26
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 5 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -28 x=5 r=0
  item -25 type 1
CHOOSE bucket -25 x 5 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -25 x=5 r=0
  item 96 type 0
CHOOSE got 96
CHOOSE returns 1
CHOOSE got -25
 crush_bucket_choose -28 x=5 r=1
  item -22 type 1
CHOOSE bucket -22 x 5 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -22 x=5 r=1
  item 84 type 0
CHOOSE got 84
CHOOSE returns 2
CHOOSE got -22
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 5 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 5 [96,84]
CRUSHCHOOSE bucket -29 x 6 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=6 r=0
  item -28 type 2
CHOOSE got -28
 crush_bucket_choose -29 x=6 r=1
  item -27 type 2
CHOOSE got -27
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 6 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -28 x=6 r=0
  item -22 type 1
CHOOSE bucket -22 x 6 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -22 x=6 r=0
  item 86 type 0
CHOOSE got 86
CHOOSE returns 1
CHOOSE got -22
 crush_bucket_choose -28 x=6 r=1
  item -23 type 1
CHOOSE bucket -23 x 6 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -23 x=6 r=1
  item 90 type 0
CHOOSE got 90
CHOOSE returns 2
CHOOSE got -23
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 6 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 6 [86,90]
CRUSHCHOOSE bucket -29 x 7 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=7 r=0
  item -26 type 2
CHOOSE got -26
 crush_bucket_choose -29 x=7 r=1
  item -28 type 2
CHOOSE got -28
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 7 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -26 x=7 r=0
  item -10 type 1
CHOOSE bucket -10 x 7 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -10 x=7 r=0
  item 38 type 0
CHOOSE got 38
CHOOSE returns 1
CHOOSE got -10
 crush_bucket_choose -26 x=7 r=1
  item -2 type 1
CHOOSE bucket -2 x 7 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -2 x=7 r=1
  item 4 type 0
CHOOSE got 4
CHOOSE returns 2
CHOOSE got -2
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 7 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 7 [38,4]
CRUSHCHOOSE bucket -29 x 8 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=8 r=0
  item -26 type 2
CHOOSE got -26
 crush_bucket_choose -29 x=8 r=1
  item -27 type 2
CHOOSE got -27
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 8 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -26 x=8 r=0
  item -6 type 1
CHOOSE bucket -6 x 8 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -6 x=8 r=0
  item 22 type 0
CHOOSE got 22
CHOOSE returns 1
CHOOSE got -6
 crush_bucket_choose -26 x=8 r=1
  item -2 type 1
CHOOSE bucket -2 x 8 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -2 x=8 r=1
  item 6 type 0
CHOOSE got 6
CHOOSE returns 2
CHOOSE got -2
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 8 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 8 [22,6]
CRUSHCHOOSE bucket -29 x 9 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=9 r=0
  item -26 type 2
CHOOSE got -26
 crush_bucket_choose -29 x=9 r=1
  item -28 type 2
CHOOSE got -28
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 9 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -26 x=9 r=0
  item -4 type 1
CHOOSE bucket -4 x 9 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -4 x=9 r=0
  item 12 type 0
CHOOSE got 12
CHOOSE returns 1
CHOOSE got -4
 crush_bucket_choose -26 x=9 r=1
  item -4 type 1
  reject 0  collide 1  ftotal 1  flocal 1
 crush_bucket_choose -26 x=9 r=2
  item -1 type 1
CHOOSE bucket -1 x 9 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -1 x=9 r=1
  item 1 type 0
CHOOSE got 1
CHOOSE returns 2
CHOOSE got -1
CHOOSE returns 2
CHOOSE_LEAF bucket -28 x 9 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 9 [12,1]
CRUSHCHOOSE bucket -29 x 10 outpos 0 numrep 2 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -29 x=10 r=0
  item -26 type 2
CHOOSE got -26
 crush_bucket_choose -29 x=10 r=1
  item -26 type 2
  reject 0  collide 1  ftotal 1  flocal 1
 crush_bucket_choose -29 x=10 r=2
  item -27 type 2
CHOOSE got -27
CHOOSE returns 2
CHOOSE_LEAF bucket -26 x 10 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -26 x=10 r=0
  item -6 type 1
CHOOSE bucket -6 x 10 outpos 0 numrep 1 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -6 x=10 r=0
  item 23 type 0
CHOOSE got 23
CHOOSE returns 1
CHOOSE got -6
 crush_bucket_choose -26 x=10 r=1
  item -8 type 1
CHOOSE bucket -8 x 10 outpos 1 numrep 2 tries 1 recurse_tries 0 local_retries 0 local_fallback_retries 0 parent_r 0
 crush_bucket_choose -8 x=10 r=1
  item 28 type 0
CHOOSE got 28
CHOOSE returns 2
CHOOSE got -8
CHOOSE returns 2
CHOOSE_LEAF bucket -27 x 10 outpos 0 numrep 4 tries 51 recurse_tries 1 local_retries 0 local_fallback_retries 0 parent_r 0
CHOOSE returns 0
 rule 1 x 10 [23,28]
rule 1 (myrule) num_rep 2 result size == 2:    10/10
==8349== 
==8349== HEAP SUMMARY:
==8349==     in use at exit: 17,936 bytes in 219 blocks
==8349==   total heap usage: 884 allocs, 665 frees, 57,490 bytes allocated
==8349== 
==8349== LEAK SUMMARY:
==8349==    definitely lost: 0 bytes in 0 blocks
==8349==    indirectly lost: 0 bytes in 0 blocks
==8349==      possibly lost: 6,376 bytes in 147 blocks
==8349==    still reachable: 11,560 bytes in 72 blocks
==8349==         suppressed: 0 bytes in 0 blocks
==8349== Rerun with --leak-check=full to see details of leaked memory
==8349== 
==8349== For counts of detected and suppressed errors, rerun with: -v
==8349== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

Actions #5

Updated by Johnu George over 9 years ago

Ran valgrind with the patch and no errors were found with different rule combinations of num_rep and number of osds to be selected.

rule myrule {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 4 type host
step emit
}

rule myrule2 {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type rack
step chooseleaf firstn 1 type host
step emit
}

rule myrule3 {
ruleset 3
type replicated
min_size 1
max_size 10
step take default
step choose firstn 3 type rack
step chooseleaf firstn 1 type host
step emit
}

rule myrule4 {
ruleset 4
type replicated
min_size 1
max_size 10
step take default
step choose firstn 4 type rack
step chooseleaf firstn 1 type host
step emit
}

ule myrule5 {
ruleset 5
type replicated
min_size 1
max_size 10
step take default
step choose firstn 4 type rack
step chooseleaf firstn 0 type host
step emit
}

rule myrule6 {
ruleset 6
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn -1 type host
step emit
}

rule myrule7 {
ruleset 7
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 0 type host
step emit
}

rule myrule8 {
ruleset 8
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 2 type host
step emit
}

rule myrule9 {
ruleset 9
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn 1 type host
step emit
}

Actions #6

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 0 to 100
Actions #7

Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to Need More Info

What happens with indep ?

Actions #8

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to Resolved
Actions #9

Updated by Loïc Dachary over 9 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to giant, firefly

I think both patches should be backported to giant and firefly. Would you like to do that ? It essentially means you should

git checkout -b wip-9492-crush-giant origin/giant
git cherry-pick -x <commit hash> <commit hash>

and submit a pull request against giant instead of the default which is master. And then the same for firefly. There should not be any conflicts because I don't think the code changed in this area since firefly.

Actions #10

Updated by Johnu George over 9 years ago

Pull req info :

fix for firstn rules: https://github.com/ceph/ceph/pull/2568
fix for indep rules : https://github.com/ceph/ceph/pull/2599

Actions #11

Updated by Loïc Dachary over 9 years ago

  • Status changed from Pending Backport to Fix Under Review
Actions #12

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions #13

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (10)
Actions

Also available in: Atom PDF