Project

General

Profile

Bug #57206

ceph_test_libcephfs_reclaim crashes during test

Added by Venky Shankar 3 months ago. Updated about 2 months ago.

Status:
Triaged
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
crash, task(easy)
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2022-08-18_04:30:42-fs-wip-vshankar-testing1-20220818-082047-testing-default-smithi/6978421

Core is at: ./remote/smithi061/coredump/1660821251.63191.core

file ./remote/smithi061/coredump/1660821251.63191.core
./remote/smithi061/coredump/1660821251.63191.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'ceph_test_libcephfs_reclaim', real uid: 1000, effective uid: 1000, real gid: 1267, effective gid: 1267, execfn: '/usr/bin/ceph_test_libcephfs_reclaim', platform: 'x86_64'

(Didn't go to fetch the backtrace from the core)

History

#1 Updated by Venky Shankar 3 months ago

  • Status changed from New to Triaged
  • Assignee set to Tamar Shacked
  • Labels (FS) task(easy) added

#2 Updated by Tamar Shacked 3 months ago

I"ve used https://github.com/ceph/ceph/blob/main/src/script/ceph-debug-docker.sh for deploying the build on container.
This is the bt of the crash, it happen in the beginning and seems to be related to parameters to 'rbd_features_from_string(const std::string& orig_value,std::ostream *err)'
I need to figure how to get symbols of /usr/lib/ceph/libceph-common.so.2 for watching rbd_features_from_string parameters

(gdb) bt full
#0  0x00007fc83f6053ee in std::locale::operator==(std::locale const&) const () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#1  0x00007fc83f4c6798 in boost::detail::lcast_ret_unsigned<std::char_traits<char>, unsigned long, char>::convert() () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#2  0x00007fc83f4c59e4 in librbd::rbd_features_from_string(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::ostream*) ()
   from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#3  0x00007fc83f150e0f in ?? () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#4  0x00007fc83f0acfb6 in Option::pre_validate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#5  0x00007fc83f0af5cb in Option::parse_value(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::variant<std::monostate, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, unsigned long, long, double, bool, entity_addr_t, entity_addrvec_t, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000l> >, Option::size_t, uuid_d>*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#6  0x00007fc83f07efa2 in md_config_t::_set_val(ConfigValues&, ConfigTracker const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Option const&, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#7  0x00007fc83f07f417 in md_config_t::set_val_default(ConfigValues&, ConfigTracker const&, std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#8  0x00007fc83f08ef0a in md_config_t::md_config_t(ConfigValues&, ConfigTracker const&, bool) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#9  0x00007fc83f035bdc in ceph::common::CephContext::CephContext(unsigned int, ceph::common::CephContext::create_options const&) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#10 0x00007fc83f036c61 in ceph::common::CephContext::CephContext(unsigned int, code_environment_t, int) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#11 0x00007fc83f075675 in common_preinit(CephInitParameters const&, code_environment_t, int) () from /usr/lib/ceph/libceph-common.so.2
No symbol table info available.
#12 0x00007fc83fe4a7f9 in ceph_create () from /lib/libcephfs.so.2
No symbol table info available.
#13 0x0000555a90b7882a in update_root_mode () at ./src/test/libcephfs/reclaim.cc:149
        admin = 0xe90000007f
        r = <optimized out>
        admin = <optimized out>
        r = <optimized out>
#14 main (argc=<optimized out>, argv=0x7ffc0ba75948) at ./src/test/libcephfs/reclaim.cc:149
        r = <optimized out>

#3 Updated by Tamar Shacked 3 months ago

The same crash reported for rgw - https://tracker.ceph.com/issues/57050
I"ll go over it to get the details.

#4 Updated by Venky Shankar 3 months ago

Tamar,

Were you able to go through the changes for the rgw fix here: https://github.com/ceph/ceph/pull/47504 to see if we'd need to do something similar?

Cheers,
Venky

#5 Updated by Venky Shankar 2 months ago

  • Assignee changed from Tamar Shacked to Milind Changire

Milind, PTAL. FWIW - https://github.com/ceph/ceph/pull/47504 fixes a similar issue for RGW.

#6 Updated by Milind Changire 2 months ago

This doesn't crash on my local ubuntu focal vstart cluster.
The stack trace points to a boost::lexical_cast<>

Hypothesis:
I'm not sure if boost::lexical_cast<> has any runtime requirements that may not match with those on the build environment.
Hence the crash.

#7 Updated by Venky Shankar 2 months ago

Milind Changire wrote:

This doesn't crash on my local ubuntu focal vstart cluster.
The stack trace points to a boost::lexical_cast<>

Hypothesis:
I'm not sure if boost::lexical_cast<> has any runtime requirements that may not match with those on the build environment.
Hence the crash.

Its not reproducible always. Did you check what changes were done in https://github.com/ceph/ceph/pull/47504 to mitigate this crash in rgw?

#8 Updated by Milind Changire about 2 months ago

Venky Shankar wrote:

Milind Changire wrote:

This doesn't crash on my local ubuntu focal vstart cluster.
The stack trace points to a boost::lexical_cast<>

Hypothesis:
I'm not sure if boost::lexical_cast<> has any runtime requirements that may not match with those on the build environment.
Hence the crash.

Its not reproducible always. Did you check what changes were done in https://github.com/ceph/ceph/pull/47504 to mitigate this crash in rgw?

The rgw solution is to stop building radosgw as a shared library and instead build an executable binary.
There were also some references on the internet which discussed about avoiding linking against static libstdc++ when building a shared library.
Nothing conclusive so far. If the problem was indeed about linking a shared library against static libstdc++, then the problem should've been 100% reproducible in a vstart cluster as well.

#9 Updated by Venky Shankar about 2 months ago

Milind Changire wrote:

Venky Shankar wrote:

Milind Changire wrote:

This doesn't crash on my local ubuntu focal vstart cluster.
The stack trace points to a boost::lexical_cast<>

Hypothesis:
I'm not sure if boost::lexical_cast<> has any runtime requirements that may not match with those on the build environment.
Hence the crash.

Its not reproducible always. Did you check what changes were done in https://github.com/ceph/ceph/pull/47504 to mitigate this crash in rgw?

The rgw solution is to stop building radosgw as a shared library and instead build an executable binary.
There were also some references on the internet which discussed about avoiding linking against static libstdc++ when building a shared library.
Nothing conclusive so far. If the problem was indeed about linking a shared library against static libstdc++, then the problem should've been 100% reproducible in a vstart cluster as well.

Probably depends on the version of libstdc++ I guess. Do you see the version in the logs or maybe another linked (copy) version?

Also available in: Atom PDF