Project

General

Profile

Bug #39174

crushtool crash on Fedora 28 and newer

Added by Ken Dreyer 4 months ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
04/10/2019
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

On Fedora 29, Fedora 30, and RHEL 8, /usr/bin/crushtool crashes when trying to compile the map that Rook uses.

#0  0x00007fffeecf053f in raise () from /lib64/libc.so.6
#1  0x00007fffeecda895 in abort () from /lib64/libc.so.6
#2  0x00007fffef71e7a8 in std::__replacement_assert(char const*, int, char const*, char const*) () from /usr/lib64/ceph/libceph-common.so.0
#3  0x00007fffef93a063 in std::vector<int, std::allocator<int> >::operator[](unsigned long) () from /usr/lib64/ceph/libceph-common.so.0
#4  0x00007fffefb882a5 in CrushCompiler::parse_bucket(__gnu_cxx::__normal_iterator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >*, std::vector<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >, std::allocator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> > > > > const&)
    () from /usr/lib64/ceph/libceph-common.so.0
#5  0x00007fffefb88ab0 in CrushCompiler::parse_crush(__gnu_cxx::__normal_iterator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >*, std::vector<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >, std::allocator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> > > > > const&)
    () from /usr/lib64/ceph/libceph-common.so.0
#6  0x00007fffefb8aee8 in CrushCompiler::compile(std::istream&, char const*) ()
   from /usr/lib64/ceph/libceph-common.so.0
#7  0x0000555555562e13 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/include/c++/8/bits/basic_string.h:2290

The crushmap.txt is:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 0
tunable straw_calc_version 1
tunable allowed_bucket_algs 22

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# default bucket
root default {
    id -1   # do not change unnecessarily
    alg straw
    hash 0  # rjenkins1
}

# rules
rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

This crash occurs on the following platforms:

  • ceph-base-12.2.7-1.fc28
  • ceph-base-12.2.11-1.fc28
  • ceph-base-12.2.11-1.fc29
  • ceph-base-14.2.0-1.fc30
  • tip of nautilus on RHEL 8

It does not crash on:

  • ceph-base-12.2.8-1.fc27

One difference I see between Fedora 27 and 28 is that Fedora 27 has libstdc++-7.3.1 and Fedora 28 has libstdc++-8.3.1 , but that is just a guess.


Related issues

Copied to RADOS - Backport #39309: luminous: crushtool crash on Fedora 28 and newer New
Copied to RADOS - Backport #39310: nautilus: crushtool crash on Fedora 28 and newer Resolved
Copied to RADOS - Backport #39311: mimic: crushtool crash on Fedora 28 and newer Resolved

History

#1 Updated by Ken Dreyer 4 months ago

  • Description updated (diff)

#2 Updated by Ken Dreyer 4 months ago

  • Subject changed from crushmap crash on Fedora 29 and newer to crushmap crash on Fedora 28 and newer
  • Description updated (diff)

#3 Updated by Ken Dreyer 4 months ago

  • Subject changed from crushmap crash on Fedora 28 and newer to crushtool crash on Fedora 28 and newer
  • Description updated (diff)

#4 Updated by Ken Dreyer 4 months ago

  • Description updated (diff)

#5 Updated by Ken Dreyer 4 months ago

  • Priority changed from Normal to Urgent

#6 Updated by Vasu Kulkarni 4 months ago

very good reason to drop one distro in teuthology and replace it with fedora 28, I think Brad brought this up long time back too in #sepia.

#7 Updated by Brad Hubbard 4 months ago

Vasu Kulkarni wrote:

very good reason to drop one distro in teuthology and replace it with fedora 28, I think Brad brought this up long time back too in #sepia.

Many times, long ago, yes.

I'm looking into this crash.

#8 Updated by Brad Hubbard 4 months ago

  • Project changed from mgr to RADOS
  • Status changed from New to Verified
  • Assignee set to Brad Hubbard
  • Source set to Q/A

#9 Updated by Brad Hubbard 4 months ago

Turning up verbosity gives clues to what might be the problem.

<mock-chroot> sh-4.4# ./crushtool -v -c crushmap.txt 2>&1|head -25
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 0
tunable straw_calc_version 1
tunable allowed_bucket_algs 22
type 0 'osd'
type 1 'host'
type 2 'chassis'
type 3 'rack'
type 4 'row'
type 5 'pdu'
type 6 'pod'
type 7 'room'
type 8 'datacenter'
type 9 'region'
type 10 'root'
bucket default id -1
bucket default (-1) 0 items and weight 0
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = int; _Alloc = std::allocator<int>; std::vector<_Tp, _Alloc>::reference = int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
*** Caught signal (Aborted) **
 in thread 7f64629f6540 thread_name:crushtool
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

The problem here is we have the following code in src/crush/CrushCompiler.cc

 561 int CrushCompiler::parse_bucket(iter_t const& i)                                                                                                                                                                                        
 562 {
...
 651   vector<int> items(size);                                                                                                                                                                                                              
 652   vector<int> weights(size);
...
 746   int r = crush.add_bucket(id, alg, hash, type, size,                                                                                                                                                                                   
 747                            &items[0], &weights[0], &idout);

Looking at a coredump.

(gdb) f
#4  0x00007fffefb86e85 in CrushCompiler::parse_bucket (this=0x7fffffffcfe0, i=...) at /builddir/build/BUILD/ceph-12.2.11/src/crush/CrushCompiler.cc:746
746       int r = crush.add_bucket(id, alg, hash, type, size,
(gdb) l
741       item_id[name] = id;
742       item_weight[id] = bucketweight;
743       
744       assert(id != 0);
745       int idout;
746       int r = crush.add_bucket(id, alg, hash, type, size,
747                                &items[0], &weights[0], &idout);
748       if (r < 0) {
749         if (r == -EEXIST)
750           err << "Duplicate bucket id " << id << std::endl;
(gdb) p items
$1 = std::vector of length 0, capacity 0
(gdb) p weights
$2 = std::vector of length 0, capacity 0
(gdb) down
#3  0x00007fffef936783 in std::vector<int, std::allocator<int> >::operator[] (this=this@entry=0x7fffffffc5a0, __n=__n@entry=0) at /usr/include/c++/8/bits/stl_vector.h:805
805           size() const _GLIBCXX_NOEXCEPT
(gdb) 
#2  0x00007fffef716168 in std::__replacement_assert (__file=__file@entry=0x7fffefc134c0 "/usr/include/c++/8/bits/stl_vector.h", __line=__line@entry=932, 
    __function=__function@entry=0x7fffefc47ca0 <_ZZNSt6vectorIiSaIiEEixEmE19__PRETTY_FUNCTION__> "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = int; _Alloc = std::allocator<int>; std::vector<_Tp, _Alloc>::reference = int&;"..., __condition=__condition@entry=0x7fffefc13490 "__builtin_expect(__n < this->size(), true)") at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2391
2391        __builtin_abort();
(gdb) l
2386      __replacement_assert(const char* __file, int __line,
2387                           const char* __function, const char* __condition)
2388      {
2389        __builtin_printf("%s:%d: %s: Assertion '%s' failed.\n", __file, __line,
2390                         __function, __condition);
2391        __builtin_abort();
2392      }
2393    }
2394    #define __glibcxx_assert_impl(_Condition)                                \
2395      do
(gdb) printf "Assertion '%s' failed.\n", __condition
Assertion '__builtin_expect(__n < this->size(), true)' failed.
(gdb) up
#3  0x00007fffef936783 in std::vector<int, std::allocator<int> >::operator[] (this=this@entry=0x7fffffffc5a0, __n=__n@entry=0) at /usr/include/c++/8/bits/stl_vector.h:805
805           size() const _GLIBCXX_NOEXCEPT
(gdb) p __n
$3 = 0
(gdb) p  this->size()
$4 = 0

Well fair enough. So why here? Why now?

Due to the inclusion of _GLIBCXX_ASSERTIONS in the CXXFLAGS. The use of the address of element 0 of an empty vector is considered unsafe although it will historically do what you want. However, here we are being pulled up on it. I suspect we need to pass the data() [1] member function of vector here but I'll need to do some testing.

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#464 first line under "Rationale:"

#11 Updated by Brad Hubbard 4 months ago

  • Status changed from Verified to Need Review

#12 Updated by Brad Hubbard 4 months ago

  • Pull request ID set to 27506

#13 Updated by Brad Hubbard 4 months ago

  • Status changed from Need Review to In Progress

#14 Updated by Brad Hubbard 4 months ago

  • Backport set to nautilus, mimic, luminous

#15 Updated by Sage Weil 4 months ago

  • Status changed from In Progress to Pending Backport

#16 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39309: luminous: crushtool crash on Fedora 28 and newer added

#17 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39310: nautilus: crushtool crash on Fedora 28 and newer added

#18 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #39311: mimic: crushtool crash on Fedora 28 and newer added

Also available in: Atom PDF