Project

General

Profile

Actions

Bug #49387

closed

several crashes from bad_alloc exceptions

Added by Casey Bodley about 3 years ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

from http://qa-proxy.ceph.com/teuthology/yuvalif-2021-02-18_15:55:47-rgw:verify-wip-yuval-fix-49322-distro-basic-gibba/5892724/teuthology.log

2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:  what():  Bad allocation
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:*** Caught signal (Aborted) **
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout: in thread 7efe0aa21700 thread_name:civetweb-worker
2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: ceph version 17.0.0-828-g915963ec (915963ecb9effcb1f2d38f444c1bb9307f8ffbe1) quincy (dev)
2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7efe4211fdc0]
2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 2: gsignal()
2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 3: abort()
2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7efe4111e06b]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7efe4112450c]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7efe41124567]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7efe411247c8]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7efe4c697bd6]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7efe4c6eccb6]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7efe4c6ecd06]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7efe4c6ecd5e]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7efe4d358b2f]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7efe4d3843d2]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7efe4d2cbb31]
2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xbde) [0x7efe4cf619fe]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 16: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int*)+0x27bd) [0x7efe4cf657ad]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x280) [0x7efe4ce9c2d0]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 18: /lib64/libradosgw.so.2(+0x621056) [0x7efe4d02b056]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 19: /lib64/libradosgw.so.2(+0x622cc7) [0x7efe4d02ccc7]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 20: /lib64/libradosgw.so.2(+0x623188) [0x7efe4d02d188]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7efe421152de]
2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 22: clone()

from http://qa-proxy.ceph.com/teuthology/yuvalif-2021-02-18_15:55:47-rgw:verify-wip-yuval-fix-49322-distro-basic-gibba/5892714/teuthology.log

2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout:  what():  Bad allocation
2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout:*** Caught signal (Aborted) **
2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout: in thread 7fe9ed93c700 thread_name:radosgw
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: ceph version 17.0.0-828-g915963ec (915963ecb9effcb1f2d38f444c1bb9307f8ffbe1) quincy (dev)
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7fea780e0dc0]
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 2: gsignal()
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 3: abort()
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7fea770df06b]
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7fea770e550c]
2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 6: /lib64/libstdc++.so.6(+0x95529) [0x7fea770e4529]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 7: __gxx_personality_v0()
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 8: /lib64/libgcc_s.so.1(+0x10b13) [0x7fea76ac5b13]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 9: _Unwind_Resume()
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 10: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x744) [0x56112803e374]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 11: /lib64/libradosgw.so.2(+0x4a2709) [0x7fea82e6d709]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 12: /lib64/libstdc++.so.6(+0xc2b73) [0x7fea77111b73]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 13: /lib64/libpthread.so.0(+0x82de) [0x7fea780d62de]
2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 14: clone()

from http://qa-proxy.ceph.com/teuthology/dang-2021-02-18_17:04:23-rgw-wip-dang-zipper-10-distro-basic-smithi/5892959/teuthology.log

2021-02-18T23:01:49.315 INFO:tasks.rgw.client.0.smithi162.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-18T23:01:49.315 INFO:tasks.rgw.client.0.smithi162.stdout:  what():  Bad allocation
2021-02-18T23:01:49.316 INFO:tasks.rgw.client.0.smithi162.stdout:*** Caught signal (Aborted) **
2021-02-18T23:01:49.316 INFO:tasks.rgw.client.0.smithi162.stdout: in thread 7f53d64b1700 thread_name:civetweb-worker
2021-02-18T23:01:49.317 INFO:tasks.rgw.client.0.smithi162.stdout: ceph version 17.0.0-856-gc06d457d (c06d457d5f977b07fb431b7c5f5f51a7467e045d) quincy (dev)
2021-02-18T23:01:49.317 INFO:tasks.rgw.client.0.smithi162.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f540bbabdc0]
2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 2: gsignal()
2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 3: abort()
2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7f540abaa06b]
2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7f540abb050c]
2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7f540abb0567]
2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7f540abb07c8]
2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7f5416123bd6]
2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7f5416178cb6]
2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7f5416178d06]
2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7f5416178d5e]
2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7f5416dddc2f]
2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7f5416e08fa2]
2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe0c) [0x7f5416d4e2dc]
2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xc0e) [0x7f54169f189e]
2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 16: (process_request(rgw::sal::RGWStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int*)+0x295d) [0x7f54169f5f0d]
2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x297) [0x7f541692dbc7]
2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 18: /lib64/libradosgw.so.2(+0x626b46) [0x7f5416abcb46]
2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 19: /lib64/libradosgw.so.2(+0x6287b7) [0x7f5416abe7b7]
2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 20: /lib64/libradosgw.so.2(+0x628c78) [0x7f5416abec78]
2021-02-18T23:01:49.323 INFO:tasks.rgw.client.0.smithi162.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7f540bba12de]
2021-02-18T23:01:49.323 INFO:tasks.rgw.client.0.smithi162.stdout: 22: clone()

these three were in the rgw/verify suite in jobs with validater/lockdep.yaml (so not under valgrind with notcmalloc builds)


Related issues 1 (0 open1 closed)

Related to mgr - Bug #49240: terminate called after throwing an instance of 'std::bad_alloc'Resolved

Actions
Actions #1

Updated by Casey Bodley about 3 years ago

from pacific (also tcmalloc) http://qa-proxy.ceph.com/teuthology/teuthology-2021-02-15_05:07:03-rgw-pacific-distro-basic-smithi/5882770/teuthology.log

2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout:  what():  Bad allocation
2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout:*** Caught signal (Aborted) **
2021-02-15T16:56:16.069 INFO:tasks.rgw.client.0.smithi041.stdout: in thread 7f9412f7b700 thread_name:radosgw
2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: ceph version 16.1.0-287-g1c2f7607 (1c2f7607ce50b13c895483f8ce2ed014b62eec26) pacific (rc)
2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f94d2f05dc0]
2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 2: gsignal()
2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 3: abort()
2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 4: /lib64/libstdc++.so.6(+0x8ff8b) [0x7f94d1f71f8b]
2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 5: /lib64/libstdc++.so.6(+0x963ec) [0x7f94d1f783ec]
2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 6: /lib64/libstdc++.so.6(+0x96447) [0x7f94d1f78447]
2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 7: /lib64/libstdc++.so.6(+0x963de) [0x7f94d1f783de]
2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 8: (spawn::detail::continuation_context::resume()+0x87) [0x7f94ddc81747]
2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 9: (boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> >::operator()(boost::system::error_code, unsigned long, bool)+0x6a) [0x7f94ddcee13a]
2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 10: (boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >::operator()(boost::system::error_code, unsigned long)+0xd8) [0x7f94ddcee3f8]
2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 11: (boost::asio::detail::executor_function<boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>, std::allocator<void> >::do_complete(boost::asio::detail::executor_function_base*, bool)+0xaf) [0x7f94ddceed7f]
2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 12: (boost::asio::detail::executor_op<boost::asio::executor::function, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x4d) [0x55942b8b886d]
2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 13: (boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>::operator()()+0x8d) [0x7f94ddc9593d]
2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 14: (void boost::asio::io_context::executor_type::dispatch<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, std::allocator<void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>&&, std::allocator<void> const&) const+0x9c) [0x7f94ddc95a5c]
2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 15: (void boost::asio::detail::strand_executor_service::dispatch<boost::asio::io_context::executor_type const, boost::asio::executor::function, std::allocator<void> >(std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> const&, boost::asio::io_context::executor_type const&, boost::asio::executor::function&&, std::allocator<void> const&)+0x1ac) [0x7f94ddce87ac]
2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 16: (void boost::asio::executor::dispatch<boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>, std::allocator<void> >(boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&) const+0x156) [0x7f94ddceef26]
2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 17: (boost::asio::detail::reactive_socket_recv_op<boost::beast::buffers_prefix_view<boost::asio::mutable_buffer>, boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::asio::detail::io_object_executor<boost::asio::executor> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x112) [0x7f94ddcef092]
2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 18: (boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1c2) [0x7f94ddc84a92]
2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 19: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x4e2) [0x55942b8bc112]
2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 20: /lib64/libradosgw.so.2(+0x47a476) [0x7f94ddc67476]
2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 21: /lib64/libstdc++.so.6(+0xc2a33) [0x7f94d1fa4a33]
2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 22: /lib64/libpthread.so.0(+0x82de) [0x7f94d2efb2de]
2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 23: clone()

Actions #2

Updated by Casey Bodley about 3 years ago

  • Priority changed from High to Urgent
Actions #3

Updated by Casey Bodley about 3 years ago

  • Related to Bug #49240: terminate called after throwing an instance of 'std::bad_alloc' added
Actions #4

Updated by Neha Ojha about 3 years ago

Casey: is this very easily reproducible in the rgw suite?

Actions #5

Updated by Casey Bodley about 3 years ago

Yes Neha, it looks like each recent suite run has a couple of these crashes. They show up as failures in our java test suite like this:

Command failed on smithi162 with status 1: 'cd /home/ubuntu/cephtest/s3-tests-java && /opt/gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache --tests BucketTest'

So far I've only seen it in our non-valgrind runs, so it may be specific to tcmalloc builds

Actions #6

Updated by Neha Ojha about 3 years ago

Casey Bodley wrote:

Yes Neha, it looks like each recent suite run has a couple of these crashes. They show up as failures in our java test suite like this:

Command failed on smithi162 with status 1: 'cd /home/ubuntu/cephtest/s3-tests-java && /opt/gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache --tests BucketTest'

So far I've only seen it in our non-valgrind runs, so it may be specific to tcmalloc builds

Hmm, any idea when this started showing up in Pacific?

Actions #8

Updated by Yuval Lifshitz about 3 years ago

one of the "posix_memalign()" crashes reproduced locally when running java_s3test against rgw with civetweb frontend (with line numbers):

#0  0x00007f5323ee29d5 in raise () from /lib64/libc.so.6
#1  0x00007f5323ecb8a4 in abort () from /lib64/libc.so.6
#2  0x00007f5324293926 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3  0x00007f532429f1ac in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007f532429f217 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007f532429f4c9 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x00007f532f3af85a in ceph::buffer::v15_2_0::raw_posix_aligned::raw_posix_aligned (this=0x5573e4d0dbc0, l=1179648, _align=8) at /root/projects/ceph/src/common/buffer.cc:171
#7  0x00007f532f3a5ae8 in ceph::buffer::v15_2_0::create_aligned_in_mempool (len=1179648, align=8, mempool=18) at /root/projects/ceph/src/common/buffer.cc:335
#8  0x00007f532f3a5b7f in ceph::buffer::v15_2_0::create_aligned (len=1179648, align=8) at /root/projects/ceph/src/common/buffer.cc:345
#9  0x00007f532f3a5718 in ceph::buffer::v15_2_0::create (len=1179648) at /root/projects/ceph/src/common/buffer.cc:287
#10 0x00007f532f3a5d98 in ceph::buffer::v15_2_0::ptr::ptr (this=0x7f52f74f94e0, l=1179648) at /root/projects/ceph/src/common/buffer.cc:369
#11 0x00007f53319365c7 in RGWPutObj_ObjStore::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest.cc:1049
#12 0x00007f5331988d15 in RGWPutObj_ObjStore_S3::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest_s3.cc:2464
#13 0x00007f53318397b4 in RGWPutObj::execute (this=0x5573e4470c00, y=...) at /root/projects/ceph/src/rgw/rgw_op.cc:3903
#14 0x00007f53312f3628 in rgw_process_authenticated (handler=0x5573e4d5d080, op=@0x7f52f74fb7f0: 0x5573e4470c00, req=0x7f52f74fc630, s=0x7f52f74fbc40, y=..., skip_retarget=false) at /root/projects/ceph/src/rgw/rgw_process.cc:167
#15 0x00007f53312f5ee4 in process_request (store=0x5573e340fc00, rest=0x7ffd5a54aa80, req=0x7f52f74fc630, frontend_prefix="", auth_registry=..., client_io=0x7f52f74fc680, olog=0x0, yield=..., scheduler=0x0, user=0x0, http_ret=0x7f52f74fc6
08) at /root/projects/ceph/src/rgw/rgw_process.cc:301
#16 0x00007f53311288b3 in RGWCivetWebFrontend::process (this=0x5573e336fdc0, conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:70
#17 0x00007f5331128682 in civetweb_callback (conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:51
#18 0x00007f5331438abd in handle_request (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:12514
#19 0x00007f533143e39d in process_new_connection (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:15924
#20 0x00007f533143eb9b in worker_thread_run (thread_args=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16253
#21 0x00007f533143ec5a in worker_thread (thread_func_param=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16296
#22 0x00007f53240793f9 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f5323fa6b53 in clone () from /lib64/libc.so.6

the call to the posix_memalign() was with len=1179648 and 'align=8`
the return value was "12" (=OOM)
also overall memory consumption of RGW was (pmap):

total          5997716K

Actions #9

Updated by Neha Ojha about 3 years ago

Seems like the issue is more prevalent in the rgw suite than in rados (yet to find a reliable reproducer).
Is https://tracker.ceph.com/issues/49387#note-8 easily reproducible? We could use this or any rgw teuthology test, to do a bisect in pacific.

Actions #10

Updated by Casey Bodley about 3 years ago

Neha Ojha wrote:

Seems like the issue is more prevalent in the rgw suite than in rados (yet to find a reliable reproducer).
Is https://tracker.ceph.com/issues/49387#note-8 easily reproducible? We could use this or any rgw teuthology test, to do a bisect in pacific.

I've so far been unable to reproduce this locally on fedora 33. I've tried with libc/tcmalloc and beast/civetweb.

@Yuval, would you be willing to bisect this with your notification reproducer on the pacific branch? The issue appeared there sometime between:

Jan. 29 commit 74275226ac79999bfd40e683dc9a1309e76033bf https://pulpito.ceph.com/teuthology-2021-01-29_05:07:02-rgw-pacific-distro-basic-smithi/
Feb. 5 commit fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c https://pulpito.ceph.com/teuthology-2021-02-05_05:07:03-rgw-pacific-distro-basic-smithi/

Actions #11

Updated by Casey Bodley about 3 years ago

I reran that 'good' commit 74275226ac79999bfd40e683dc9a1309e76033bf through the rgw/verify suite and see the crash there too, so we may need to go further back than Jan. 29.

http://qa-proxy.ceph.com/teuthology/cbodley-2021-02-24_18:26:21-rgw:verify-wip-cbodley-testing-distro-basic-smithi/5911998/teuthology.log

2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'
2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout:  what():  Bad allocation
2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout:*** Caught signal (Aborted) **
2021-02-24T18:56:24.116 INFO:tasks.rgw.client.0.smithi026.stdout: in thread 7fe8b8dbe700 thread_name:civetweb-worker
2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: ceph version 16.1.0-100-g4dfe1c66 (4dfe1c6693519967a4cee7b34ba75f38f77f8018) pacific (rc)
2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7fe8ee4b8dc0]
2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 2: gsignal()
2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 3: abort()
2021-02-24T18:56:24.121 INFO:tasks.rgw.client.0.smithi026.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7fe8ed4b706b]
2021-02-24T18:56:24.121 INFO:tasks.rgw.client.0.smithi026.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7fe8ed4bd50c]
2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7fe8ed4bd567]
2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7fe8ed4bd7c8]
2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7fe8f8a2dbd6]
2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7fe8f8a82d06]
2021-02-24T18:56:24.123 INFO:tasks.rgw.client.0.smithi026.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7fe8f8a82d56]
2021-02-24T18:56:24.123 INFO:tasks.rgw.client.0.smithi026.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7fe8f8a82dae]
2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7fe8f96c252f]
2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7fe8f96ecfc2]
2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7fe8f9637a61]
Actions #12

Updated by Yuval Lifshitz about 3 years ago

this issue is unrelated to bucket notifications. the bad_alloc that happened in bucket notifications and fixed here: https://github.com/ceph/ceph/pull/39521
(dangling reference issue).

the issue here is OOM reply from "posix_memalign()" that we translate to a bad_alloc exception (details are in comment #8).
could not reproduce that on the "pacific" branch (v16.1.0 with latest commit on Jan 22). will try to bisect on the master branch.

Actions #13

Updated by Yuval Lifshitz about 3 years ago

bisect log does not seem to make sense - "bad" commit is a python only change.
will try a more manual search:

# git bisect log
git bisect start
# good: [74275226ac79999bfd40e683dc9a1309e76033bf] Merge pull request #39021 from ceph/pacific-backport
git bisect good 74275226ac79999bfd40e683dc9a1309e76033bf
# bad: [fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c] Merge pull request #39045 from smithfarm/wip-48970-pacific
git bisect bad fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c
# good: [586ca46890a2e67db84267b49f077213013e627c] pybind/mgr/cephadm: rename variables to avoid name reuse
git bisect good 586ca46890a2e67db84267b49f077213013e627c
# bad: [dcebce1d114257b9676bb23e2be57778ee39fd8e] mgr/cephadm: force flag for ok-to-stop and ok-to-stop for monitoring stack
git bisect bad dcebce1d114257b9676bb23e2be57778ee39fd8e
# good: [e000d22a633e495afee6ddfb461997f708f35305] pybind/mgr/cephadm: return -ENOENT if self.ssh_user is None
git bisect good e000d22a633e495afee6ddfb461997f708f35305
# good: [e26335447e95a2b346eb52d6649772d04bcb965e] qa/suites/rados/cephadm: stop testing 18.04 + podman
git bisect good e26335447e95a2b346eb52d6649772d04bcb965e
# good: [77b4e535c9e7e5f979c7ba72bc782eafdbf03e3c] mgr/cephadm: Properly handle JSON Decode error
git bisect good 77b4e535c9e7e5f979c7ba72bc782eafdbf03e3c
# good: [4f5ce67fc60a0a8f44221368341d0fca36ff1075] mgr/cephadm: minor pep8 fixes
git bisect good 4f5ce67fc60a0a8f44221368341d0fca36ff1075
# first bad commit: [dcebce1d114257b9676bb23e2be57778ee39fd8e] mgr/cephadm: force flag for ok-to-stop and ok-to-stop for monitoring stack

Actions #14

Updated by Mark Kogan about 3 years ago

adding addr2line analysis of call stack from teuthology

…
2021-02-24T19:51:45.327 INFO:teuthology.orchestra.run.smithi040.stdout:suite > AWS4 tests > AWS4Test.testMultipartDownloadHLAPIAWS4 STARTED                                                                                                     

2021-02-24T19:51:45.443 INFO:tasks.rgw.client.0.smithi040.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc'                                                                                             
2021-02-24T19:51:45.443 INFO:tasks.rgw.client.0.smithi040.stdout:  what():  Bad allocation
2021-02-24T19:51:45.444 INFO:tasks.rgw.client.0.smithi040.stdout:*** Caught signal (Aborted) **                                                                                                                                                
2021-02-24T19:51:45.444 INFO:tasks.rgw.client.0.smithi040.stdout: in thread 7f0c850fe700 thread_name:civetweb-worker                                                                                                                           
2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: ceph version 17.0.0-974-gc1bcac9f (c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea) quincy (dev)                                                                                    
2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f0cb8f70dc0]                                                                                                                         
2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 2: gsignal()                                                                                                                                                                 
2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 3: abort()                                                                                                                                                                   
2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 4: /lib64/libstdc++.so.6(+0x8ff8b) [0x7f0cb7fdcf8b]                                                                                                                          
2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 5: /lib64/libstdc++.so.6(+0x963ec) [0x7f0cb7fe33ec]                                                                                                                          
2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 6: /lib64/libstdc++.so.6(+0x96447) [0x7f0cb7fe3447]                                                                                                                          
2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 7: /lib64/libstdc++.so.6(+0x966a8) [0x7f0cb7fe36a8]                                                                                                                          
2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7f0cc34e8bd6]                                                                                                                           
2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7f0cc353dcb6]                                                                                      
2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7f0cc353dd06]                                                                                                      
2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7f0cc353dd5e]                                                                                                    
2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7f0cc41b7a8f]                                                                                       
2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7f0cc41e3332]                                                                                    
2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7f0cc41288c1]                                                                                                              
2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xbde) [0x7f0cc3dbd3de]                                                  
2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 16: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyR
2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x280) [0x7f0cc3cf7c90]                                                                                                    
2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 18: /lib64/libradosgw.so.2(+0x62c7c6) [0x7f0cc3e877c6]                                                                                                                       
2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 19: /lib64/libradosgw.so.2(+0x62e437) [0x7f0cc3e89437]                                                                                                                       
2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 20: /lib64/libradosgw.so.2(+0x62e8f8) [0x7f0cc3e898f8]                                                                                                                       
2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7f0cb8f662de]                                                                                                                         
2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout: 22: clone()                                                                                                                                                                  
2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout:2021-02-24T19:51:45.444+0000 7f0c850fe700 -1 *** Caught signal (Aborted) **                                                                                                   
2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout: in thread 7f0c850fe700 thread_name:civetweb-worker     
… 

# debuginfo rpms:
...
2021-02-24T19:40:11.512 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea  
2021-02-24T19:40:11.551 INFO:teuthology.orchestra.run.smithi093.stdout:CentOS                                                                                                                                  
2021-02-24T19:40:11.552 DEBUG:teuthology.misc:System to be installed: CentOS                                                                                                                                  
2021-02-24T19:40:11.553 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea  
2021-02-24T19:40:11.680 INFO:teuthology.task.install.rpm:Pulling from https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/
2021-02-24T19:40:11.681 INFO:teuthology.task.install.rpm:Package version is 17.0.0-974.gc1bcac9f
...

axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm 
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm 
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm 
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm 

rpm2archive librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
rpm2archive ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
rpm2archive ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm
rpm2archive ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm

tar -xvf librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz
tar -xvf ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz

axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/SRPMS/ceph-17.0.0-974.gc1bcac9f.el8.src.rpm 

rpm2archive ceph-17.0.0-974.gc1bcac9f.el8.src.rpm

tar -xvf ceph-17.0.0-974.gc1bcac9f.el8.src.rpm.tgz
tar -xvf ceph-17.0.0-974-gc1bcac9f.tar.bz2

eu-addr2line --debuginfo-path=$(realpath ./usr/lib/debug/) --pretty-print -a -f -i -C -e ./usr/lib64/librados.so.2 0x3abd6
  0x000000000003abd6: ceph::buffer::v15_2_0::raw_combined::create(unsigned int, unsigned int, int) at /usr/src/debug/ceph-17.0.0-974.gc1bcac9f.el8.x86_64/src/common/buffer.cc:117:18
  (inlined by) ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int) at /usr/src/debug/ceph-17.0.0-974.gc1bcac9f.el8.x86_64/src/common/buffer.cc:340:5

vim ./ceph-17.0.0-974-gc1bcac9f/src/common/buffer.cc +117
…
97   ┊ static ceph::unique_leakable_ptr<buffer::raw>
98   ┊ create(unsigned len,
99         ┊ ┊unsigned align,
100         ┊ ┊int mempool = mempool::mempool_buffer_anon)
101   ┊ {
102   ┊ ┊ if (!align)
103         align = sizeof(size_t);
104   ┊ ┊ size_t rawlen = round_up_to(sizeof(buffer::raw_combined),
105                                 ┊ alignof(buffer::raw_combined));
106   ┊ ┊ size_t datalen = round_up_to(len, alignof(buffer::raw_combined));
107
108 #ifdef DARWIN
109   ┊ ┊ char *ptr = (char *) valloc(rawlen + datalen);
110 #else
111   ┊ ┊ char *ptr = 0;
112   ┊ ┊ int r = ::posix_memalign((void**)(void*)&ptr, align, rawlen + datalen);
113   ┊ ┊ if (r)
114         throw bad_alloc();
115 #endif /* DARWIN */
116   ┊ ┊ if (!ptr)
→ 117         throw bad_alloc();
Actions #15

Updated by Casey Bodley about 3 years ago

Mark Kogan wrote:

> 108 #ifdef DARWIN
> 109   ┊ ┊ char *ptr = (char *) valloc(rawlen + datalen);
> 110 #else
> 111   ┊ ┊ char *ptr = 0;
> 112   ┊ ┊ int r = ::posix_memalign((void**)(void*)&ptr, align, rawlen + datalen);
> 113   ┊ ┊ if (r)
> 114         throw bad_alloc();
> 115 #endif /* DARWIN */
> 116   ┊ ┊ if (!ptr)
> → 117         throw bad_alloc();
> 

i'm curious whether that line number is accurate, since it seems to contradict Yuval's statement which points at line 114:

Yuval Lifshitz wrote:

the call to the posix_memalign() was with len=1179648 and 'align=8`
the return value was "12" (=OOM)

Actions #16

Updated by Mark Kogan about 3 years ago

with Yuval's assistance reproduces on F33 machine (after dnf update)
but it not consistent (reproduces sporadically several consecutive times and stops reproducing for 10s of runs)

ceph version 16.0.0-10232-gc1bcac9fe1 (c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea) quincy (dev)                                                                                                
 1: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x2736096) [0x7f8574b0e096]                                                                          
 2: /lib64/libpthread.so.0(+0x141e0) [0x7f85671d31e0]                                                                                                                                         
 3: gsignal()                                                                                                                                                                                 
 4: abort()
 5: /lib64/libstdc++.so.6(+0x9e926) [0x7f8566fdc926]
 6: /lib64/libstdc++.so.6(+0xaa1ac) [0x7f8566fe81ac]
 7: /lib64/libstdc++.so.6(+0xa9239) [0x7f8566fe7239]
 8: __gxx_personality_v0()
 9: /lib64/libgcc_s.so.1(+0x10c2f) [0x7f8566dedc2f]
 10: _Unwind_Resume()
 11: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x19d) [0x55bc34051f9b]
 12: (boost::asio::io_context::run()+0x48) [0x7f8573f051f8]
 13: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b134f5) [0x7f8573eeb4f5]
 14: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28450) [0x7f8573f00450]
 15: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28405) [0x7f8573f00405]
 16: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b283b2) [0x7f8573f003b2]
 17: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28222) [0x7f8573f00222]
 18: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b2439a) [0x7f8573efc39a]
 19: /lib64/libstdc++.so.6(+0xd65f4) [0x7f85670145f4]
 20: /lib64/libpthread.so.0(+0x93f9) [0x7f85671c83f9]
 21: clone()

(dont mind that it is the d3n branch, the cache is disabled)

build with ./do_cmake.sh without any defines/overrides
and started with: env MON=1 OSD=1 MDS=0 MGR=1 RGW=1 NFS=0 ../src/vstart.sh -n

the tcmalloc malloc version indeed was upgraded to 2.8 in F33 (from 2.7)

rpm -qa | grep gperftools
gperftools-libs-2.8-3.fc33.x86_64
gperftools-devel-2.8-3.fc33.x86_64
                   ^     ^^

vs F32:
rpm -qa | grep gperftools
gperftools-libs-2.7-7.fc32.x86_64
gperftools-devel-2.7-7.fc32.x86_64
                   ^     ^^

and the same happened in teuthology - found a teuthology log from end of 2020 where tcmalloc version was 2.7:

2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout:================================================================================
2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout: Package              Arch   Version                     Repository        Size 
2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout:================================================================================
...
2020-12-23T19:12:44.342 INFO:teuthology.orchestra.run.smithi129.stdout:Installing dependencies:
...
2020-12-23T19:12:44.343 INFO:teuthology.orchestra.run.smithi129.stdout: gperftools-libs      x86_64 2.7-6.el8                   epel             315 k 

and version 2.8 in a recent teuthology logs:

2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout:================================================================================    
2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout: Package               Arch   Version                    Repository        Size     
2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout:================================================================================    
...
2021-02-24T19:40:19.042 INFO:teuthology.orchestra.run.smithi093.stdout:Installing dependencies:                                                            
...
2021-02-24T19:40:19.043 INFO:teuthology.orchestra.run.smithi093.stdout: gperftools-libs       x86_64 2.8-1.el8                  epel             306 k     

Actions #17

Updated by Mark Kogan about 3 years ago

(addendum to previous post)

the java s3test flow for the reproduction:

git clone git@github.com:ceph/java_s3tests.git
cd java_s3tests

cp -v config.properties.sample config.properties

./bootstrap.sh --path=$(realpath ./gradle)

./gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache
Actions #18

Updated by Ken Dreyer about 3 years ago

I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1933792 to track removing gperftools 2.8 from EPEL 8 and going back to the last 2.7 build.

Actions #19

Updated by Ken Dreyer about 3 years ago

Mark and Yuval, tangential to EPEL 8, Fedora 34 has gperftools 2.9. For EPEL 9, it would be helpful to see if this same bug reproduces on Fedora 34.

Actions #20

Updated by Ken Dreyer about 3 years ago

For Fedora 33, I've pushed gperftools-2.8.1 to updates-testing today, https://bodhi.fedoraproject.org/updates/FEDORA-2021-67257daa36

Actions #21

Updated by Dan van der Ster about 3 years ago

It seems this impacted the 15.2.9 build: https://tracker.ceph.com/issues/49618

Actions #22

Updated by David Orman about 3 years ago

Dan van der Ster wrote:

It seems this impacted the 15.2.9 build: https://tracker.ceph.com/issues/49618

Confirmed as per my update in that ticket; it's part of the Docker images for Ceph, and we are seeing frequent OSD crashes on all of our clusters running 15.2.9 via containers. The containers will need a rebuild/push to Docker Hub so that those impacted (who used cephadm to deploy via containers) can "upgrade" the clusters.

Actions #23

Updated by Josh Durgin about 3 years ago

  • Status changed from New to Resolved

15.2.10 containers have tcmalloc 2.7 again, and it's in EPEL now.

Actions #24

Updated by Casey Bodley about 2 months ago

i'm hearing reports that rhel 8.6 is shipping a gperftools 2.8.1 that has these same crashes?

edit: oops, this isn't the case. a user is seeing crashes with a custom repo that provides 2.8.1

Actions

Also available in: Atom PDF