Project

General

Profile

Actions

Bug #39174

closed

crushtool crash on Fedora 28 and newer

Added by Ken Dreyer about 5 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
nautilus, mimic, luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On Fedora 29, Fedora 30, and RHEL 8, /usr/bin/crushtool crashes when trying to compile the map that Rook uses.

#0  0x00007fffeecf053f in raise () from /lib64/libc.so.6
#1  0x00007fffeecda895 in abort () from /lib64/libc.so.6
#2  0x00007fffef71e7a8 in std::__replacement_assert(char const*, int, char const*, char const*) () from /usr/lib64/ceph/libceph-common.so.0
#3  0x00007fffef93a063 in std::vector<int, std::allocator<int> >::operator[](unsigned long) () from /usr/lib64/ceph/libceph-common.so.0
#4  0x00007fffefb882a5 in CrushCompiler::parse_bucket(__gnu_cxx::__normal_iterator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >*, std::vector<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >, std::allocator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> > > > > const&)
    () from /usr/lib64/ceph/libceph-common.so.0
#5  0x00007fffefb88ab0 in CrushCompiler::parse_crush(__gnu_cxx::__normal_iterator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >*, std::vector<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> >, std::allocator<boost::spirit::tree_node<boost::spirit::node_val_data<char const*, boost::spirit::nil_t> > > > > const&)
    () from /usr/lib64/ceph/libceph-common.so.0
#6  0x00007fffefb8aee8 in CrushCompiler::compile(std::istream&, char const*) ()
   from /usr/lib64/ceph/libceph-common.so.0
#7  0x0000555555562e13 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/include/c++/8/bits/basic_string.h:2290

The crushmap.txt is:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 0
tunable straw_calc_version 1
tunable allowed_bucket_algs 22

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# default bucket
root default {
    id -1   # do not change unnecessarily
    alg straw
    hash 0  # rjenkins1
}

# rules
rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

This crash occurs on the following platforms:

  • ceph-base-12.2.7-1.fc28
  • ceph-base-12.2.11-1.fc28
  • ceph-base-12.2.11-1.fc29
  • ceph-base-14.2.0-1.fc30
  • tip of nautilus on RHEL 8

It does not crash on:

  • ceph-base-12.2.8-1.fc27

One difference I see between Fedora 27 and 28 is that Fedora 27 has libstdc++-7.3.1 and Fedora 28 has libstdc++-8.3.1 , but that is just a guess.


Related issues 3 (0 open3 closed)

Copied to RADOS - Backport #39309: luminous: crushtool crash on Fedora 28 and newerRejectedActions
Copied to RADOS - Backport #39310: nautilus: crushtool crash on Fedora 28 and newerResolvedNeha OjhaActions
Copied to RADOS - Backport #39311: mimic: crushtool crash on Fedora 28 and newerResolvedPrashant DActions
Actions #1

Updated by Ken Dreyer about 5 years ago

  • Description updated (diff)
Actions #2

Updated by Ken Dreyer about 5 years ago

  • Subject changed from crushmap crash on Fedora 29 and newer to crushmap crash on Fedora 28 and newer
  • Description updated (diff)
Actions #3

Updated by Ken Dreyer about 5 years ago

  • Subject changed from crushmap crash on Fedora 28 and newer to crushtool crash on Fedora 28 and newer
  • Description updated (diff)
Actions #4

Updated by Ken Dreyer about 5 years ago

  • Description updated (diff)
Actions #5

Updated by Ken Dreyer about 5 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by Vasu Kulkarni about 5 years ago

very good reason to drop one distro in teuthology and replace it with fedora 28, I think Brad brought this up long time back too in #sepia.

Actions #7

Updated by Brad Hubbard about 5 years ago

Vasu Kulkarni wrote:

very good reason to drop one distro in teuthology and replace it with fedora 28, I think Brad brought this up long time back too in #sepia.

Many times, long ago, yes.

I'm looking into this crash.

Actions #8

Updated by Brad Hubbard about 5 years ago

  • Project changed from mgr to RADOS
  • Status changed from New to 12
  • Assignee set to Brad Hubbard
  • Source set to Q/A
Actions #9

Updated by Brad Hubbard about 5 years ago

Turning up verbosity gives clues to what might be the problem.

<mock-chroot> sh-4.4# ./crushtool -v -c crushmap.txt 2>&1|head -25
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 0
tunable straw_calc_version 1
tunable allowed_bucket_algs 22
type 0 'osd'
type 1 'host'
type 2 'chassis'
type 3 'rack'
type 4 'row'
type 5 'pdu'
type 6 'pod'
type 7 'room'
type 8 'datacenter'
type 9 'region'
type 10 'root'
bucket default id -1
bucket default (-1) 0 items and weight 0
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = int; _Alloc = std::allocator<int>; std::vector<_Tp, _Alloc>::reference = int&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
*** Caught signal (Aborted) **
 in thread 7f64629f6540 thread_name:crushtool
 ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)

The problem here is we have the following code in src/crush/CrushCompiler.cc

 561 int CrushCompiler::parse_bucket(iter_t const& i)                                                                                                                                                                                        
 562 {
...
 651   vector<int> items(size);                                                                                                                                                                                                              
 652   vector<int> weights(size);
...
 746   int r = crush.add_bucket(id, alg, hash, type, size,                                                                                                                                                                                   
 747                            &items[0], &weights[0], &idout);

Looking at a coredump.

(gdb) f
#4  0x00007fffefb86e85 in CrushCompiler::parse_bucket (this=0x7fffffffcfe0, i=...) at /builddir/build/BUILD/ceph-12.2.11/src/crush/CrushCompiler.cc:746
746       int r = crush.add_bucket(id, alg, hash, type, size,
(gdb) l
741       item_id[name] = id;
742       item_weight[id] = bucketweight;
743       
744       assert(id != 0);
745       int idout;
746       int r = crush.add_bucket(id, alg, hash, type, size,
747                                &items[0], &weights[0], &idout);
748       if (r < 0) {
749         if (r == -EEXIST)
750           err << "Duplicate bucket id " << id << std::endl;
(gdb) p items
$1 = std::vector of length 0, capacity 0
(gdb) p weights
$2 = std::vector of length 0, capacity 0
(gdb) down
#3  0x00007fffef936783 in std::vector<int, std::allocator<int> >::operator[] (this=this@entry=0x7fffffffc5a0, __n=__n@entry=0) at /usr/include/c++/8/bits/stl_vector.h:805
805           size() const _GLIBCXX_NOEXCEPT
(gdb) 
#2  0x00007fffef716168 in std::__replacement_assert (__file=__file@entry=0x7fffefc134c0 "/usr/include/c++/8/bits/stl_vector.h", __line=__line@entry=932, 
    __function=__function@entry=0x7fffefc47ca0 <_ZZNSt6vectorIiSaIiEEixEmE19__PRETTY_FUNCTION__> "std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = int; _Alloc = std::allocator<int>; std::vector<_Tp, _Alloc>::reference = int&;"..., __condition=__condition@entry=0x7fffefc13490 "__builtin_expect(__n < this->size(), true)") at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2391
2391        __builtin_abort();
(gdb) l
2386      __replacement_assert(const char* __file, int __line,
2387                           const char* __function, const char* __condition)
2388      {
2389        __builtin_printf("%s:%d: %s: Assertion '%s' failed.\n", __file, __line,
2390                         __function, __condition);
2391        __builtin_abort();
2392      }
2393    }
2394    #define __glibcxx_assert_impl(_Condition)                                \
2395      do
(gdb) printf "Assertion '%s' failed.\n", __condition
Assertion '__builtin_expect(__n < this->size(), true)' failed.
(gdb) up
#3  0x00007fffef936783 in std::vector<int, std::allocator<int> >::operator[] (this=this@entry=0x7fffffffc5a0, __n=__n@entry=0) at /usr/include/c++/8/bits/stl_vector.h:805
805           size() const _GLIBCXX_NOEXCEPT
(gdb) p __n
$3 = 0
(gdb) p  this->size()
$4 = 0

Well fair enough. So why here? Why now?

Due to the inclusion of _GLIBCXX_ASSERTIONS in the CXXFLAGS. The use of the address of element 0 of an empty vector is considered unsafe although it will historically do what you want. However, here we are being pulled up on it. I suspect we need to pass the data() [1] member function of vector here but I'll need to do some testing.

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#464 first line under "Rationale:"

Actions #11

Updated by Brad Hubbard about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #12

Updated by Brad Hubbard about 5 years ago

  • Pull request ID set to 27506
Actions #13

Updated by Brad Hubbard about 5 years ago

  • Status changed from Fix Under Review to In Progress
Actions #14

Updated by Brad Hubbard about 5 years ago

  • Backport set to nautilus, mimic, luminous
Actions #15

Updated by Sage Weil about 5 years ago

  • Status changed from In Progress to Pending Backport
Actions #16

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39309: luminous: crushtool crash on Fedora 28 and newer added
Actions #17

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39310: nautilus: crushtool crash on Fedora 28 and newer added
Actions #18

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #39311: mimic: crushtool crash on Fedora 28 and newer added
Actions #19

Updated by Nathan Cutler about 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF