Bug #49387
closedseveral crashes from bad_alloc exceptions
0%
Description
2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout: what(): Bad allocation 2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout:*** Caught signal (Aborted) ** 2021-02-18T17:49:45.348 INFO:tasks.rgw.client.0.gibba014.stdout: in thread 7efe0aa21700 thread_name:civetweb-worker 2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: ceph version 17.0.0-828-g915963ec (915963ecb9effcb1f2d38f444c1bb9307f8ffbe1) quincy (dev) 2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7efe4211fdc0] 2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 2: gsignal() 2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 3: abort() 2021-02-18T17:49:45.350 INFO:tasks.rgw.client.0.gibba014.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7efe4111e06b] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7efe4112450c] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7efe41124567] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7efe411247c8] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7efe4c697bd6] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7efe4c6eccb6] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7efe4c6ecd06] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7efe4c6ecd5e] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7efe4d358b2f] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7efe4d3843d2] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7efe4d2cbb31] 2021-02-18T17:49:45.351 INFO:tasks.rgw.client.0.gibba014.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xbde) [0x7efe4cf619fe] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 16: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int*)+0x27bd) [0x7efe4cf657ad] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x280) [0x7efe4ce9c2d0] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 18: /lib64/libradosgw.so.2(+0x621056) [0x7efe4d02b056] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 19: /lib64/libradosgw.so.2(+0x622cc7) [0x7efe4d02ccc7] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 20: /lib64/libradosgw.so.2(+0x623188) [0x7efe4d02d188] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7efe421152de] 2021-02-18T17:49:45.352 INFO:tasks.rgw.client.0.gibba014.stdout: 22: clone()
2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout: what(): Bad allocation 2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout:*** Caught signal (Aborted) ** 2021-02-18T17:31:41.342 INFO:tasks.rgw.client.0.gibba004.stdout: in thread 7fe9ed93c700 thread_name:radosgw 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: ceph version 17.0.0-828-g915963ec (915963ecb9effcb1f2d38f444c1bb9307f8ffbe1) quincy (dev) 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7fea780e0dc0] 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 2: gsignal() 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 3: abort() 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7fea770df06b] 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7fea770e550c] 2021-02-18T17:31:41.343 INFO:tasks.rgw.client.0.gibba004.stdout: 6: /lib64/libstdc++.so.6(+0x95529) [0x7fea770e4529] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 7: __gxx_personality_v0() 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 8: /lib64/libgcc_s.so.1(+0x10b13) [0x7fea76ac5b13] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 9: _Unwind_Resume() 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 10: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x744) [0x56112803e374] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 11: /lib64/libradosgw.so.2(+0x4a2709) [0x7fea82e6d709] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 12: /lib64/libstdc++.so.6(+0xc2b73) [0x7fea77111b73] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 13: /lib64/libpthread.so.0(+0x82de) [0x7fea780d62de] 2021-02-18T17:31:41.344 INFO:tasks.rgw.client.0.gibba004.stdout: 14: clone()
2021-02-18T23:01:49.315 INFO:tasks.rgw.client.0.smithi162.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-18T23:01:49.315 INFO:tasks.rgw.client.0.smithi162.stdout: what(): Bad allocation 2021-02-18T23:01:49.316 INFO:tasks.rgw.client.0.smithi162.stdout:*** Caught signal (Aborted) ** 2021-02-18T23:01:49.316 INFO:tasks.rgw.client.0.smithi162.stdout: in thread 7f53d64b1700 thread_name:civetweb-worker 2021-02-18T23:01:49.317 INFO:tasks.rgw.client.0.smithi162.stdout: ceph version 17.0.0-856-gc06d457d (c06d457d5f977b07fb431b7c5f5f51a7467e045d) quincy (dev) 2021-02-18T23:01:49.317 INFO:tasks.rgw.client.0.smithi162.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f540bbabdc0] 2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 2: gsignal() 2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 3: abort() 2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7f540abaa06b] 2021-02-18T23:01:49.318 INFO:tasks.rgw.client.0.smithi162.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7f540abb050c] 2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7f540abb0567] 2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7f540abb07c8] 2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7f5416123bd6] 2021-02-18T23:01:49.319 INFO:tasks.rgw.client.0.smithi162.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7f5416178cb6] 2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7f5416178d06] 2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7f5416178d5e] 2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7f5416dddc2f] 2021-02-18T23:01:49.320 INFO:tasks.rgw.client.0.smithi162.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7f5416e08fa2] 2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe0c) [0x7f5416d4e2dc] 2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xc0e) [0x7f54169f189e] 2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 16: (process_request(rgw::sal::RGWStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, int*)+0x295d) [0x7f54169f5f0d] 2021-02-18T23:01:49.321 INFO:tasks.rgw.client.0.smithi162.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x297) [0x7f541692dbc7] 2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 18: /lib64/libradosgw.so.2(+0x626b46) [0x7f5416abcb46] 2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 19: /lib64/libradosgw.so.2(+0x6287b7) [0x7f5416abe7b7] 2021-02-18T23:01:49.322 INFO:tasks.rgw.client.0.smithi162.stdout: 20: /lib64/libradosgw.so.2(+0x628c78) [0x7f5416abec78] 2021-02-18T23:01:49.323 INFO:tasks.rgw.client.0.smithi162.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7f540bba12de] 2021-02-18T23:01:49.323 INFO:tasks.rgw.client.0.smithi162.stdout: 22: clone()
these three were in the rgw/verify suite in jobs with validater/lockdep.yaml (so not under valgrind with notcmalloc builds)
Updated by Casey Bodley about 3 years ago
from pacific (also tcmalloc) http://qa-proxy.ceph.com/teuthology/teuthology-2021-02-15_05:07:03-rgw-pacific-distro-basic-smithi/5882770/teuthology.log
2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout: what(): Bad allocation 2021-02-15T16:56:16.068 INFO:tasks.rgw.client.0.smithi041.stdout:*** Caught signal (Aborted) ** 2021-02-15T16:56:16.069 INFO:tasks.rgw.client.0.smithi041.stdout: in thread 7f9412f7b700 thread_name:radosgw 2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: ceph version 16.1.0-287-g1c2f7607 (1c2f7607ce50b13c895483f8ce2ed014b62eec26) pacific (rc) 2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f94d2f05dc0] 2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 2: gsignal() 2021-02-15T16:56:16.070 INFO:tasks.rgw.client.0.smithi041.stdout: 3: abort() 2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 4: /lib64/libstdc++.so.6(+0x8ff8b) [0x7f94d1f71f8b] 2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 5: /lib64/libstdc++.so.6(+0x963ec) [0x7f94d1f783ec] 2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 6: /lib64/libstdc++.so.6(+0x96447) [0x7f94d1f78447] 2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 7: /lib64/libstdc++.so.6(+0x963de) [0x7f94d1f783de] 2021-02-15T16:56:16.071 INFO:tasks.rgw.client.0.smithi041.stdout: 8: (spawn::detail::continuation_context::resume()+0x87) [0x7f94ddc81747] 2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 9: (boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> >::operator()(boost::system::error_code, unsigned long, bool)+0x6a) [0x7f94ddcee13a] 2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 10: (boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >::operator()(boost::system::error_code, unsigned long)+0xd8) [0x7f94ddcee3f8] 2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 11: (boost::asio::detail::executor_function<boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>, std::allocator<void> >::do_complete(boost::asio::detail::executor_function_base*, bool)+0xaf) [0x7f94ddceed7f] 2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 12: (boost::asio::detail::executor_op<boost::asio::executor::function, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x4d) [0x55942b8b886d] 2021-02-15T16:56:16.072 INFO:tasks.rgw.client.0.smithi041.stdout: 13: (boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>::operator()()+0x8d) [0x7f94ddc9593d] 2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 14: (void boost::asio::io_context::executor_type::dispatch<boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>, std::allocator<void> >(boost::asio::detail::strand_executor_service::invoker<boost::asio::io_context::executor_type const>&&, std::allocator<void> const&) const+0x9c) [0x7f94ddc95a5c] 2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 15: (void boost::asio::detail::strand_executor_service::dispatch<boost::asio::io_context::executor_type const, boost::asio::executor::function, std::allocator<void> >(std::shared_ptr<boost::asio::detail::strand_executor_service::strand_impl> const&, boost::asio::io_context::executor_type const&, boost::asio::executor::function&&, std::allocator<void> const&)+0x1ac) [0x7f94ddce87ac] 2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 16: (void boost::asio::executor::dispatch<boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>, std::allocator<void> >(boost::asio::detail::binder2<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::system::error_code, unsigned long>&&, std::allocator<void> const&) const+0x156) [0x7f94ddceef26] 2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 17: (boost::asio::detail::reactive_socket_recv_op<boost::beast::buffers_prefix_view<boost::asio::mutable_buffer>, boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>::ops::transfer_op<true, boost::asio::mutable_buffer, boost::beast::detail::dynamic_read_ops::read_op<boost::beast::basic_stream<boost::asio::ip::tcp, boost::asio::executor, boost::beast::unlimited_rate_policy>, boost::beast::flat_static_buffer<65536ul>, boost::beast::http::detail::read_some_condition<true>, spawn::detail::coro_handler<boost::asio::executor_binder<void (*)(), boost::asio::executor>, unsigned long> > >, boost::asio::detail::io_object_executor<boost::asio::executor> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x112) [0x7f94ddcef092] 2021-02-15T16:56:16.073 INFO:tasks.rgw.client.0.smithi041.stdout: 18: (boost::asio::detail::epoll_reactor::descriptor_state::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)+0x1c2) [0x7f94ddc84a92] 2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 19: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x4e2) [0x55942b8bc112] 2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 20: /lib64/libradosgw.so.2(+0x47a476) [0x7f94ddc67476] 2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 21: /lib64/libstdc++.so.6(+0xc2a33) [0x7f94d1fa4a33] 2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 22: /lib64/libpthread.so.0(+0x82de) [0x7f94d2efb2de] 2021-02-15T16:56:16.074 INFO:tasks.rgw.client.0.smithi041.stdout: 23: clone()
Updated by Casey Bodley about 3 years ago
- Related to Bug #49240: terminate called after throwing an instance of 'std::bad_alloc' added
Updated by Neha Ojha about 3 years ago
Casey: is this very easily reproducible in the rgw suite?
Updated by Casey Bodley about 3 years ago
Yes Neha, it looks like each recent suite run has a couple of these crashes. They show up as failures in our java test suite like this:
Command failed on smithi162 with status 1: 'cd /home/ubuntu/cephtest/s3-tests-java && /opt/gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache --tests BucketTest'
So far I've only seen it in our non-valgrind runs, so it may be specific to tcmalloc builds
Updated by Neha Ojha about 3 years ago
Casey Bodley wrote:
Yes Neha, it looks like each recent suite run has a couple of these crashes. They show up as failures in our java test suite like this:
Command failed on smithi162 with status 1: 'cd /home/ubuntu/cephtest/s3-tests-java && /opt/gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache --tests BucketTest'
So far I've only seen it in our non-valgrind runs, so it may be specific to tcmalloc builds
Hmm, any idea when this started showing up in Pacific?
Updated by Casey Bodley about 3 years ago
the first pacific run i see it in is https://pulpito.ceph.com/teuthology-2021-02-05_05:07:03-rgw-pacific-distro-basic-smithi from feb. 6
Updated by Yuval Lifshitz about 3 years ago
one of the "posix_memalign()" crashes reproduced locally when running java_s3test against rgw with civetweb frontend (with line numbers):
#0 0x00007f5323ee29d5 in raise () from /lib64/libc.so.6 #1 0x00007f5323ecb8a4 in abort () from /lib64/libc.so.6 #2 0x00007f5324293926 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6 #3 0x00007f532429f1ac in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6 #4 0x00007f532429f217 in std::terminate() () from /lib64/libstdc++.so.6 #5 0x00007f532429f4c9 in __cxa_throw () from /lib64/libstdc++.so.6 #6 0x00007f532f3af85a in ceph::buffer::v15_2_0::raw_posix_aligned::raw_posix_aligned (this=0x5573e4d0dbc0, l=1179648, _align=8) at /root/projects/ceph/src/common/buffer.cc:171 #7 0x00007f532f3a5ae8 in ceph::buffer::v15_2_0::create_aligned_in_mempool (len=1179648, align=8, mempool=18) at /root/projects/ceph/src/common/buffer.cc:335 #8 0x00007f532f3a5b7f in ceph::buffer::v15_2_0::create_aligned (len=1179648, align=8) at /root/projects/ceph/src/common/buffer.cc:345 #9 0x00007f532f3a5718 in ceph::buffer::v15_2_0::create (len=1179648) at /root/projects/ceph/src/common/buffer.cc:287 #10 0x00007f532f3a5d98 in ceph::buffer::v15_2_0::ptr::ptr (this=0x7f52f74f94e0, l=1179648) at /root/projects/ceph/src/common/buffer.cc:369 #11 0x00007f53319365c7 in RGWPutObj_ObjStore::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest.cc:1049 #12 0x00007f5331988d15 in RGWPutObj_ObjStore_S3::get_data (this=0x5573e4470c00, bl=...) at /root/projects/ceph/src/rgw/rgw_rest_s3.cc:2464 #13 0x00007f53318397b4 in RGWPutObj::execute (this=0x5573e4470c00, y=...) at /root/projects/ceph/src/rgw/rgw_op.cc:3903 #14 0x00007f53312f3628 in rgw_process_authenticated (handler=0x5573e4d5d080, op=@0x7f52f74fb7f0: 0x5573e4470c00, req=0x7f52f74fc630, s=0x7f52f74fbc40, y=..., skip_retarget=false) at /root/projects/ceph/src/rgw/rgw_process.cc:167 #15 0x00007f53312f5ee4 in process_request (store=0x5573e340fc00, rest=0x7ffd5a54aa80, req=0x7f52f74fc630, frontend_prefix="", auth_registry=..., client_io=0x7f52f74fc680, olog=0x0, yield=..., scheduler=0x0, user=0x0, http_ret=0x7f52f74fc6 08) at /root/projects/ceph/src/rgw/rgw_process.cc:301 #16 0x00007f53311288b3 in RGWCivetWebFrontend::process (this=0x5573e336fdc0, conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:70 #17 0x00007f5331128682 in civetweb_callback (conn=0x5573e43109d8) at /root/projects/ceph/src/rgw/rgw_civetweb_frontend.cc:51 #18 0x00007f5331438abd in handle_request (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:12514 #19 0x00007f533143e39d in process_new_connection (conn=0x5573e43109d8) at /root/projects/ceph/src/civetweb/src/civetweb.c:15924 #20 0x00007f533143eb9b in worker_thread_run (thread_args=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16253 #21 0x00007f533143ec5a in worker_thread (thread_func_param=0x5573e33445e0) at /root/projects/ceph/src/civetweb/src/civetweb.c:16296 #22 0x00007f53240793f9 in start_thread () from /lib64/libpthread.so.0 #23 0x00007f5323fa6b53 in clone () from /lib64/libc.so.6
the call to the posix_memalign() was with len=1179648 and 'align=8`
the return value was "12" (=OOM)
also overall memory consumption of RGW was (pmap):
total 5997716K
Updated by Neha Ojha about 3 years ago
Seems like the issue is more prevalent in the rgw suite than in rados (yet to find a reliable reproducer).
Is https://tracker.ceph.com/issues/49387#note-8 easily reproducible? We could use this or any rgw teuthology test, to do a bisect in pacific.
Updated by Casey Bodley about 3 years ago
Neha Ojha wrote:
Seems like the issue is more prevalent in the rgw suite than in rados (yet to find a reliable reproducer).
Is https://tracker.ceph.com/issues/49387#note-8 easily reproducible? We could use this or any rgw teuthology test, to do a bisect in pacific.
I've so far been unable to reproduce this locally on fedora 33. I've tried with libc/tcmalloc and beast/civetweb.
@Yuval, would you be willing to bisect this with your notification reproducer on the pacific branch? The issue appeared there sometime between:
Jan. 29 commit 74275226ac79999bfd40e683dc9a1309e76033bf https://pulpito.ceph.com/teuthology-2021-01-29_05:07:02-rgw-pacific-distro-basic-smithi/
Feb. 5 commit fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c https://pulpito.ceph.com/teuthology-2021-02-05_05:07:03-rgw-pacific-distro-basic-smithi/
Updated by Casey Bodley about 3 years ago
I reran that 'good' commit 74275226ac79999bfd40e683dc9a1309e76033bf through the rgw/verify suite and see the crash there too, so we may need to go further back than Jan. 29.
2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout: what(): Bad allocation 2021-02-24T18:56:24.115 INFO:tasks.rgw.client.0.smithi026.stdout:*** Caught signal (Aborted) ** 2021-02-24T18:56:24.116 INFO:tasks.rgw.client.0.smithi026.stdout: in thread 7fe8b8dbe700 thread_name:civetweb-worker 2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: ceph version 16.1.0-100-g4dfe1c66 (4dfe1c6693519967a4cee7b34ba75f38f77f8018) pacific (rc) 2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7fe8ee4b8dc0] 2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 2: gsignal() 2021-02-24T18:56:24.120 INFO:tasks.rgw.client.0.smithi026.stdout: 3: abort() 2021-02-24T18:56:24.121 INFO:tasks.rgw.client.0.smithi026.stdout: 4: /lib64/libstdc++.so.6(+0x9006b) [0x7fe8ed4b706b] 2021-02-24T18:56:24.121 INFO:tasks.rgw.client.0.smithi026.stdout: 5: /lib64/libstdc++.so.6(+0x9650c) [0x7fe8ed4bd50c] 2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 6: /lib64/libstdc++.so.6(+0x96567) [0x7fe8ed4bd567] 2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 7: /lib64/libstdc++.so.6(+0x967c8) [0x7fe8ed4bd7c8] 2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7fe8f8a2dbd6] 2021-02-24T18:56:24.122 INFO:tasks.rgw.client.0.smithi026.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7fe8f8a82d06] 2021-02-24T18:56:24.123 INFO:tasks.rgw.client.0.smithi026.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7fe8f8a82d56] 2021-02-24T18:56:24.123 INFO:tasks.rgw.client.0.smithi026.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7fe8f8a82dae] 2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7fe8f96c252f] 2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7fe8f96ecfc2] 2021-02-24T18:56:24.124 INFO:tasks.rgw.client.0.smithi026.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7fe8f9637a61]
Updated by Yuval Lifshitz about 3 years ago
this issue is unrelated to bucket notifications. the bad_alloc that happened in bucket notifications and fixed here: https://github.com/ceph/ceph/pull/39521
(dangling reference issue).
the issue here is OOM reply from "posix_memalign()" that we translate to a bad_alloc exception (details are in comment #8).
could not reproduce that on the "pacific" branch (v16.1.0 with latest commit on Jan 22). will try to bisect on the master branch.
Updated by Yuval Lifshitz about 3 years ago
bisect log does not seem to make sense - "bad" commit is a python only change.
will try a more manual search:
# git bisect log git bisect start # good: [74275226ac79999bfd40e683dc9a1309e76033bf] Merge pull request #39021 from ceph/pacific-backport git bisect good 74275226ac79999bfd40e683dc9a1309e76033bf # bad: [fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c] Merge pull request #39045 from smithfarm/wip-48970-pacific git bisect bad fc1e361ac9e05b37d8f9c5ac0b433ca305ec485c # good: [586ca46890a2e67db84267b49f077213013e627c] pybind/mgr/cephadm: rename variables to avoid name reuse git bisect good 586ca46890a2e67db84267b49f077213013e627c # bad: [dcebce1d114257b9676bb23e2be57778ee39fd8e] mgr/cephadm: force flag for ok-to-stop and ok-to-stop for monitoring stack git bisect bad dcebce1d114257b9676bb23e2be57778ee39fd8e # good: [e000d22a633e495afee6ddfb461997f708f35305] pybind/mgr/cephadm: return -ENOENT if self.ssh_user is None git bisect good e000d22a633e495afee6ddfb461997f708f35305 # good: [e26335447e95a2b346eb52d6649772d04bcb965e] qa/suites/rados/cephadm: stop testing 18.04 + podman git bisect good e26335447e95a2b346eb52d6649772d04bcb965e # good: [77b4e535c9e7e5f979c7ba72bc782eafdbf03e3c] mgr/cephadm: Properly handle JSON Decode error git bisect good 77b4e535c9e7e5f979c7ba72bc782eafdbf03e3c # good: [4f5ce67fc60a0a8f44221368341d0fca36ff1075] mgr/cephadm: minor pep8 fixes git bisect good 4f5ce67fc60a0a8f44221368341d0fca36ff1075 # first bad commit: [dcebce1d114257b9676bb23e2be57778ee39fd8e] mgr/cephadm: force flag for ok-to-stop and ok-to-stop for monitoring stack
Updated by Mark Kogan about 3 years ago
adding addr2line analysis of call stack from teuthology
… 2021-02-24T19:51:45.327 INFO:teuthology.orchestra.run.smithi040.stdout:suite > AWS4 tests > AWS4Test.testMultipartDownloadHLAPIAWS4 STARTED 2021-02-24T19:51:45.443 INFO:tasks.rgw.client.0.smithi040.stdout:terminate called after throwing an instance of 'ceph::buffer::v15_2_0::bad_alloc' 2021-02-24T19:51:45.443 INFO:tasks.rgw.client.0.smithi040.stdout: what(): Bad allocation 2021-02-24T19:51:45.444 INFO:tasks.rgw.client.0.smithi040.stdout:*** Caught signal (Aborted) ** 2021-02-24T19:51:45.444 INFO:tasks.rgw.client.0.smithi040.stdout: in thread 7f0c850fe700 thread_name:civetweb-worker 2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: ceph version 17.0.0-974-gc1bcac9f (c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea) quincy (dev) 2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7f0cb8f70dc0] 2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 2: gsignal() 2021-02-24T19:51:45.446 INFO:tasks.rgw.client.0.smithi040.stdout: 3: abort() 2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 4: /lib64/libstdc++.so.6(+0x8ff8b) [0x7f0cb7fdcf8b] 2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 5: /lib64/libstdc++.so.6(+0x963ec) [0x7f0cb7fe33ec] 2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 6: /lib64/libstdc++.so.6(+0x96447) [0x7f0cb7fe3447] 2021-02-24T19:51:45.447 INFO:tasks.rgw.client.0.smithi040.stdout: 7: /lib64/libstdc++.so.6(+0x966a8) [0x7f0cb7fe36a8] 2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 8: /lib64/librados.so.2(+0x3abd6) [0x7f0cc34e8bd6] 2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 9: (ceph::buffer::v15_2_0::create_aligned(unsigned int, unsigned int)+0x26) [0x7f0cc353dcb6] 2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 10: (ceph::buffer::v15_2_0::create(unsigned int)+0x26) [0x7f0cc353dd06] 2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 11: (ceph::buffer::v15_2_0::ptr::ptr(unsigned int)+0x2e) [0x7f0cc353dd5e] 2021-02-24T19:51:45.448 INFO:tasks.rgw.client.0.smithi040.stdout: 12: (RGWPutObj_ObjStore::get_data(ceph::buffer::v15_2_0::list&)+0xaf) [0x7f0cc41b7a8f] 2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 13: (RGWPutObj_ObjStore_S3::get_data(ceph::buffer::v15_2_0::list&)+0x12) [0x7f0cc41e3332] 2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 14: (RGWPutObj::execute(optional_yield)+0xe21) [0x7f0cc41288c1] 2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 15: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, optional_yield, bool)+0xbde) [0x7f0cc3dbd3de] 2021-02-24T19:51:45.449 INFO:tasks.rgw.client.0.smithi040.stdout: 16: (process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rgw::auth::StrategyR 2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 17: (RGWCivetWebFrontend::process(mg_connection*)+0x280) [0x7f0cc3cf7c90] 2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 18: /lib64/libradosgw.so.2(+0x62c7c6) [0x7f0cc3e877c6] 2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 19: /lib64/libradosgw.so.2(+0x62e437) [0x7f0cc3e89437] 2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 20: /lib64/libradosgw.so.2(+0x62e8f8) [0x7f0cc3e898f8] 2021-02-24T19:51:45.450 INFO:tasks.rgw.client.0.smithi040.stdout: 21: /lib64/libpthread.so.0(+0x82de) [0x7f0cb8f662de] 2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout: 22: clone() 2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout:2021-02-24T19:51:45.444+0000 7f0c850fe700 -1 *** Caught signal (Aborted) ** 2021-02-24T19:51:45.451 INFO:tasks.rgw.client.0.smithi040.stdout: in thread 7f0c850fe700 thread_name:civetweb-worker … # debuginfo rpms: ... 2021-02-24T19:40:11.512 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea 2021-02-24T19:40:11.551 INFO:teuthology.orchestra.run.smithi093.stdout:CentOS 2021-02-24T19:40:11.552 DEBUG:teuthology.misc:System to be installed: CentOS 2021-02-24T19:40:11.553 DEBUG:teuthology.packaging:Querying https://shaman.ceph.com/api/search?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea 2021-02-24T19:40:11.680 INFO:teuthology.task.install.rpm:Pulling from https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/ 2021-02-24T19:40:11.681 INFO:teuthology.task.install.rpm:Package version is 17.0.0-974.gc1bcac9f ... axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/x86_64/ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm rpm2archive librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm rpm2archive ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm rpm2archive ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm rpm2archive ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm tar -xvf librados2-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf librados2-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-radosgw-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-radosgw-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-base-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-base-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-common-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz tar -xvf ceph-common-debuginfo-17.0.0-974.gc1bcac9f.el8.x86_64.rpm.tgz axel -a https://1.chacra.ceph.com/r/ceph/wip-rgw-d3n-cache-upstreaming_i47/c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea/centos/8/flavors/default/SRPMS/ceph-17.0.0-974.gc1bcac9f.el8.src.rpm rpm2archive ceph-17.0.0-974.gc1bcac9f.el8.src.rpm tar -xvf ceph-17.0.0-974.gc1bcac9f.el8.src.rpm.tgz tar -xvf ceph-17.0.0-974-gc1bcac9f.tar.bz2 eu-addr2line --debuginfo-path=$(realpath ./usr/lib/debug/) --pretty-print -a -f -i -C -e ./usr/lib64/librados.so.2 0x3abd6 0x000000000003abd6: ceph::buffer::v15_2_0::raw_combined::create(unsigned int, unsigned int, int) at /usr/src/debug/ceph-17.0.0-974.gc1bcac9f.el8.x86_64/src/common/buffer.cc:117:18 (inlined by) ceph::buffer::v15_2_0::create_aligned_in_mempool(unsigned int, unsigned int, int) at /usr/src/debug/ceph-17.0.0-974.gc1bcac9f.el8.x86_64/src/common/buffer.cc:340:5 vim ./ceph-17.0.0-974-gc1bcac9f/src/common/buffer.cc +117 … 97 ┊ static ceph::unique_leakable_ptr<buffer::raw> 98 ┊ create(unsigned len, 99 ┊ ┊unsigned align, 100 ┊ ┊int mempool = mempool::mempool_buffer_anon) 101 ┊ { 102 ┊ ┊ if (!align) 103 align = sizeof(size_t); 104 ┊ ┊ size_t rawlen = round_up_to(sizeof(buffer::raw_combined), 105 ┊ alignof(buffer::raw_combined)); 106 ┊ ┊ size_t datalen = round_up_to(len, alignof(buffer::raw_combined)); 107 108 #ifdef DARWIN 109 ┊ ┊ char *ptr = (char *) valloc(rawlen + datalen); 110 #else 111 ┊ ┊ char *ptr = 0; 112 ┊ ┊ int r = ::posix_memalign((void**)(void*)&ptr, align, rawlen + datalen); 113 ┊ ┊ if (r) 114 throw bad_alloc(); 115 #endif /* DARWIN */ 116 ┊ ┊ if (!ptr) → 117 throw bad_alloc();
Updated by Casey Bodley about 3 years ago
Mark Kogan wrote:
> 108 #ifdef DARWIN > 109 ┊ ┊ char *ptr = (char *) valloc(rawlen + datalen); > 110 #else > 111 ┊ ┊ char *ptr = 0; > 112 ┊ ┊ int r = ::posix_memalign((void**)(void*)&ptr, align, rawlen + datalen); > 113 ┊ ┊ if (r) > 114 throw bad_alloc(); > 115 #endif /* DARWIN */ > 116 ┊ ┊ if (!ptr) > → 117 throw bad_alloc(); >
i'm curious whether that line number is accurate, since it seems to contradict Yuval's statement which points at line 114:
Yuval Lifshitz wrote:
the call to the posix_memalign() was with len=1179648 and 'align=8`
the return value was "12" (=OOM)
Updated by Mark Kogan about 3 years ago
with Yuval's assistance reproduces on F33 machine (after dnf update)
but it not consistent (reproduces sporadically several consecutive times and stops reproducing for 10s of runs)
ceph version 16.0.0-10232-gc1bcac9fe1 (c1bcac9fe1b99f005d296f6ba4d652b4ba7181ea) quincy (dev) 1: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x2736096) [0x7f8574b0e096] 2: /lib64/libpthread.so.0(+0x141e0) [0x7f85671d31e0] 3: gsignal() 4: abort() 5: /lib64/libstdc++.so.6(+0x9e926) [0x7f8566fdc926] 6: /lib64/libstdc++.so.6(+0xaa1ac) [0x7f8566fe81ac] 7: /lib64/libstdc++.so.6(+0xa9239) [0x7f8566fe7239] 8: __gxx_personality_v0() 9: /lib64/libgcc_s.so.1(+0x10c2f) [0x7f8566dedc2f] 10: _Unwind_Resume() 11: (boost::asio::detail::scheduler::run(boost::system::error_code&)+0x19d) [0x55bc34051f9b] 12: (boost::asio::io_context::run()+0x48) [0x7f8573f051f8] 13: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b134f5) [0x7f8573eeb4f5] 14: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28450) [0x7f8573f00450] 15: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28405) [0x7f8573f00405] 16: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b283b2) [0x7f8573f003b2] 17: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b28222) [0x7f8573f00222] 18: /mnt/nvme0/src-git/ceph--amathuria__d3n-cache-upstreaming/build/lib/libradosgw.so.2(+0x1b2439a) [0x7f8573efc39a] 19: /lib64/libstdc++.so.6(+0xd65f4) [0x7f85670145f4] 20: /lib64/libpthread.so.0(+0x93f9) [0x7f85671c83f9] 21: clone()
(dont mind that it is the d3n branch, the cache is disabled)
build with ./do_cmake.sh without any defines/overrides
and started with: env MON=1 OSD=1 MDS=0 MGR=1 RGW=1 NFS=0 ../src/vstart.sh -n
the tcmalloc malloc version indeed was upgraded to 2.8 in F33 (from 2.7)
rpm -qa | grep gperftools gperftools-libs-2.8-3.fc33.x86_64 gperftools-devel-2.8-3.fc33.x86_64 ^ ^^
vs F32:
rpm -qa | grep gperftools gperftools-libs-2.7-7.fc32.x86_64 gperftools-devel-2.7-7.fc32.x86_64 ^ ^^
and the same happened in teuthology - found a teuthology log from end of 2020 where tcmalloc version was 2.7:
2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout:================================================================================ 2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout: Package Arch Version Repository Size 2020-12-23T19:12:44.340 INFO:teuthology.orchestra.run.smithi129.stdout:================================================================================ ... 2020-12-23T19:12:44.342 INFO:teuthology.orchestra.run.smithi129.stdout:Installing dependencies: ... 2020-12-23T19:12:44.343 INFO:teuthology.orchestra.run.smithi129.stdout: gperftools-libs x86_64 2.7-6.el8 epel 315 k
and version 2.8 in a recent teuthology logs:
2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout:================================================================================ 2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout: Package Arch Version Repository Size 2021-02-24T19:40:19.040 INFO:teuthology.orchestra.run.smithi093.stdout:================================================================================ ... 2021-02-24T19:40:19.042 INFO:teuthology.orchestra.run.smithi093.stdout:Installing dependencies: ... 2021-02-24T19:40:19.043 INFO:teuthology.orchestra.run.smithi093.stdout: gperftools-libs x86_64 2.8-1.el8 epel 306 k
Updated by Mark Kogan about 3 years ago
(addendum to previous post)
the java s3test flow for the reproduction:
git clone git@github.com:ceph/java_s3tests.git cd java_s3tests cp -v config.properties.sample config.properties ./bootstrap.sh --path=$(realpath ./gradle) ./gradle/gradle/bin/gradle clean test --rerun-tasks --no-build-cache
Updated by Ken Dreyer about 3 years ago
I've opened https://bugzilla.redhat.com/show_bug.cgi?id=1933792 to track removing gperftools 2.8 from EPEL 8 and going back to the last 2.7 build.
Updated by Ken Dreyer about 3 years ago
Mark and Yuval, tangential to EPEL 8, Fedora 34 has gperftools 2.9. For EPEL 9, it would be helpful to see if this same bug reproduces on Fedora 34.
Updated by Ken Dreyer about 3 years ago
For Fedora 33, I've pushed gperftools-2.8.1 to updates-testing today, https://bodhi.fedoraproject.org/updates/FEDORA-2021-67257daa36
Updated by Dan van der Ster about 3 years ago
It seems this impacted the 15.2.9 build: https://tracker.ceph.com/issues/49618
Updated by David Orman about 3 years ago
Dan van der Ster wrote:
It seems this impacted the 15.2.9 build: https://tracker.ceph.com/issues/49618
Confirmed as per my update in that ticket; it's part of the Docker images for Ceph, and we are seeing frequent OSD crashes on all of our clusters running 15.2.9 via containers. The containers will need a rebuild/push to Docker Hub so that those impacted (who used cephadm to deploy via containers) can "upgrade" the clusters.
Updated by Josh Durgin about 3 years ago
- Status changed from New to Resolved
15.2.10 containers have tcmalloc 2.7 again, and it's in EPEL now.
Updated by Casey Bodley 2 months ago
i'm hearing reports that rhel 8.6 is shipping a gperftools 2.8.1 that has these same crashes?
edit: oops, this isn't the case. a user is seeing crashes with a custom repo that provides 2.8.1