Bug #43562
closedError in tcmalloc
0%
Description
Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 1: (()+0x2d80c1) [0x5650249170c1] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 2: (()+0xf5d0) [0x7f28a99f55d0] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 3: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)+0x4d) [0x7f28a9c2605d] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0xd6) [0x7f28a9c26436] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int)+0x67) [0x7f28a9c292a7] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 6: (()+0xd30c6) [0x7f28aa9b60c6] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 7: (()+0xda18f) [0x7f28aa9bd18f] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 8: (()+0xe6df8) [0x7f28aa9c9df8] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 9: (()+0xe707a) [0x7f28aa9ca07a] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 10: (librados::IoCtxImpl::operate_read(object_t const&, ObjectOperation*, ceph::buffer::list*, int)+0x1d6) [0x7f28aa979506] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 11: (librados::IoCtx::operate(std::string const&, librados::ObjectReadOperation*, ceph::buffer::list*)+0x46) [0x7f28aa93cc56] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 12: (RGWRados::raw_obj_stat(rgw_raw_obj&, unsigned long*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, unsigned long*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, ceph::buffer::list*, RGWObjVersionTracker*)+0x19a) [0x565024a5b5aa] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 13: (RGWCache<RGWRados>::raw_obj_stat(rgw_raw_obj&, unsigned long*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, unsigned long*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, ceph::buffer::list*, RGWObjVersionTracker*)+0x3e4) [0x565024ad3824] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 14: (RGWRados::get_system_obj_state_impl(RGWObjectCtx*, rgw_raw_obj&, RGWRawObjState**, RGWObjVersionTracker*)+0x225) [0x565024a71c95] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 15: (RGWRados::get_system_obj_state(RGWObjectCtx*, rgw_raw_obj&, RGWRawObjState**, RGWObjVersionTracker*)+0x34) [0x565024a72144] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 16: (RGWRados::stat_system_obj(RGWObjectCtx&, RGWRados::SystemObject::Read::GetObjState&, rgw_raw_obj&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, unsigned long*, RGWObjVersionTracker*)+0x5a) [0x565024a721da] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 17: (RGWRados::SystemObject::Read::stat(RGWObjVersionTracker*)+0x38) [0x565024a43028] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 18: (rgw_get_system_obj(RGWRados*, RGWObjectCtx&, rgw_pool const&, std::string const&, ceph::buffer::list&, RGWObjVersionTracker*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x1e5) [0x565024b6f215] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 19: (RGWRados::get_bucket_entrypoint_info(RGWObjectCtx&, std::string const&, std::string const&, RGWBucketEntryPoint&, RGWObjVersionTracker*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, rgw_cache_entry_info*, boost::optional<obj_version>)+0x143) [0x565024a7e6a3] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 20: (RGWRados::_get_bucket_info(RGWObjectCtx&, std::string const&, std::string const&, RGWBucketInfo&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*, boost::optional<obj_version>)+0x77b) [0x565024a7faab] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 21: (RGWRados::get_bucket_info(RGWObjectCtx&, std::string const&, std::string const&, RGWBucketInfo&, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x32) [0x565024a80692] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 22: (RGWBucket::init(RGWRados*, RGWBucketAdminOpState&, std::string*, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >*)+0x303) [0x56502492aa93] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 23: (RGWBucketAdminOp::info(RGWRados*, RGWBucketAdminOpState&, RGWFormatterFlusher&)+0x87) [0x565024932fb7] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 24: (RGWOp_Bucket_Info::execute()+0x281) [0x565024ad9291] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 25: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x188) [0x565024a326c8] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 26: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, int*)+0xb88) [0x565024a33488] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 27: (()+0x264623) [0x5650248a3623] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 28: (()+0x2652db) [0x5650248a42db] Jan 13 10:14:56 LNVERDAGW1611 radosgw[160278]: 29: (make_fcontext()+0x2f) [0x565024ca829f] Jan 13 10:15:01 LNVERDAGW1611 CROND[84079]: (root) CMD (/root/bin/ckdm.sh) Jan 13 10:15:01 LNVERDAGW1611 CROND[84080]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 13 10:16:01 LNVERDAGW1611 CROND[84190]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 13 10:16:01 LNVERDAGW1611 CROND[84191]: (root) CMD (/root/bin/ckdm.sh) Jan 13 10:17:01 LNVERDAGW1611 CROND[84293]: (root) CMD (/root/bin/ckdm.sh) Jan 13 10:17:01 LNVERDAGW1611 CROND[84294]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 13 10:18:01 LNVERDAGW1611 CROND[84397]: (root) CMD (/root/bin/ckdm.sh) Jan 13 10:18:01 LNVERDAGW1611 CROND[84398]: (root) CMD (/usr/lib64/sa/sa1 1 1) Jan 13 10:18:03 LNVERDAGW1611 radosgw[160278]: *** Caught signal (Segmentation fault) ** Jan 13 10:18:03 LNVERDAGW1611 radosgw[160278]: in thread 7f256e403700 thread_name:radosgw Jan 13 10:18:03 LNVERDAGW1611 kernel: traps: radosgw[161906] general protection ip:7f28a9c38634 sp:56503b0709e0 error:0 in libtcmalloc.so.4.4.5[7f28a9c02000+46000]
Updated by Casey Bodley over 4 years ago
Does this crash reproduce every time we get this RGWOp_Bucket_Info request?
Updated by Ilsoo Byun over 4 years ago
tcmalloc errors are frequently occured when issuing 'bucket stats' command.
I think there is a memory corruption somewhere.
The location where the error occurred changes each time.
Updated by Ilsoo Byun over 4 years ago
When getting stats of bucket with 3061 shards(refer to the below code), the length of 'max_marker' string exceeds 80k chars.
In particular, segmentation fault occured in some environments where the rgw daemon is configured to use thousands of threads. I think a stack overflow has occurred.
I solved this by using the heap temporarily.
void JSONFormatter::print_quoted_string(boost::string_view s) { int len = escape_json_attr_len(s.data(), s.size()); char escaped[len]; escape_json_attr(s.data(), s.size(), escaped); m_ss << '\"' << escaped << '\"'; }
Updated by Casey Bodley over 4 years ago
- Status changed from New to In Progress
- Assignee set to Mark Kogan
Updated by Mark Kogan over 4 years ago
Does not reproduce on master with beast.
reproducer flow:
create a bucket with 2000000 objects:
~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t 20 -b 1 -n 20000000 -m cxip -bp hsb001 |& tee hsbench.log |
shard to 11621 shards so that s.length()=81857 chars:
radosgw-admin bucket reshard --bucket=hsb001000000000000 --num-shards=11621 --yes-i-really-mean-it
create a script to perform admin API requests:
vim ./usageinfo.sh
#!/bin/bash
token=b2345678901234567890 ## USER_TOKEN
secret=b234567890123456789012345678901234567890 ## USER_SECRET
query=$1
bucket=$2
query3="&uid="
query4="&bucket="
#query2=admin/usage
query2=admin/bucket
date=$(date -Ru)
header="GET\n\n\n${date}\n/${query2}"
sig=$(echo -en ${header} | openssl sha1 -hmac ${secret} -binary | base64)
curl -v -H "Date: ${date}" -H "Authorization: AWS ${token}:${sig}" -L -X GET "http://127.0.0.1:8000/${query2}?format=json${query3}${query}${query4}${bucket}&stats=true" -H "Host: 127.0.0.1:8000"
:x
execute hundreds reqs in parallel:
for i in {1..800}; do ./usageinfo.sh cosbench hsb001000000000000 & done
The issue does not reproduce.
will try civetweb and thousands of reqs.
Updated by Mark Kogan over 4 years ago
Not able to reproduce with civetweb and thousands of concurrent ops:
ps -ef | grep -c curl 2178
Some question regarding reproducing:
1. Please provide: ceph --version
2. Please attach the ceph.conf for trying to reproduce with the same threads num & memory settings.
3. Please tell which OS (centos/ubuntu/suse ?) and which version.
4. And please provide the tcmalloc version (rpm -qa | grep gperftools or the apt/zypper equvalent)
Thank you
Updated by Ilsoo Byun over 4 years ago
1. ceph version 12.2.12-el7
2. attached below
3. CentOS Linux release 7.6.1810 (Core)
4. gperftools-libs-2.6.1-1.el7.x86_64
I also couldn't reproduce it in the test environment.
I think this situation appeared when the memory usage is maximized.
[client.rgw.***] host = *** keyring = /var/lib/ceph/radosgw/ceph-rgw.***/keyring log file = /var/log/ceph/ceph-rgw-***.log rgw frontends = beast endpoint=0.0.0.0:7480 [client.rgw.***] rgw dns name = *** rgw dns s3website name = *** rgw dynamic resharding = false rgw enable apis = s3,s3website,admin rgw enable static website = true rgw enable usage log = false rgw keystone accepted roles = Member, _member_, power_user, guest, user, admin, member, project_admin, dns_admin, net_admin, storage_admin rgw keystone admin domain = default rgw keystone admin project = admin rgw keystone admin token = *** rgw keystone api version = 3 rgw keystone revocation interval = 0 rgw keystone token cache size = 10000 rgw keystone url = *** rgw lifecycle work time = 00:00-24:00 rgw resolve cname = true rgw s3 auth use keystone = true rgw thread pool size = 4096 objecter inflight op bytes = 2147483648 objecter inflight ops = 24576 rgw override bucket index max shards = 131 # Please do not change this file directly since it is managed by Ansible and will be overwritten [global] cluster network = *** fsid = *** mon host = *** public network = ***
Updated by Mark Kogan over 4 years ago
Thank you very much for the information,
testest using the provided ceph.conf params
Does not reproduce on master but does reproduce on
radosgw --version ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
*** Caught signal (Segmentation fault) ** in thread 7f28e867c700 thread_name:fn-radosclient ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable) 1: /mnt/osd--host/src-git/ceph--upstream--master/build/bin/radosgw() [0x6d73e7] 2: (()+0x14b20) [0x7f28f12c6b20] 3: (Mutex::Lock(bool)+0x4c) [0x7f28f158452c] 4: (BucketIndexAioManager::do_completion(int)+0x39) [0xab2ae9] 5: (BucketIndexAioManager::bucket_index_op_completion_cb(void*, void*)+0x15) [0xabc545] 6: (librados::C_AioComplete::finish(int)+0x3b) [0x7f28f9efbddb] 7: (()+0x6261e) [0x7f28f9ed761e] 8: (Finisher::finisher_thread_entry()+0x4e8) [0x7f28f15a7268] 9: (Thread::entry_wrapper()+0xcc) [0x7f28f182862c] 10: (()+0x94e2) [0x7f28f12bb4e2] 11: (clone()+0x43) [0x7f28f08b46d3] 2020-02-06 16:22:45.418673 7f28e867c700 -1 *** Caught signal (Segmentation fault) ** in thread 7f28e867c700 thread_name:fn-radosclient
issue 2700 requests (running 3 times):
for i in {1..900}; do ./usageinfo.sh cosbench b001b000000000000 & done
(takes about an hour for the seg fault to occur)
next, will be re-testing master with ~8000 requests and update.
Updated by Mark Kogan over 4 years ago
On master does not reproduce with 8000 req load.
Noting that the culprit stack allocation in the
JSONFormatter::print_quoted_string() function (as per comment#3)
has been has been replaced with json_stream_escaper() in master:
void JSONFormatter::print_quoted_string(std::string_view s) { m_ss << '\"' << json_stream_escaper(s) << '\"'; }
Updated by Mark Kogan over 4 years ago
- Pull request ID changed from 33257 to 33259
Updated by Casey Bodley about 4 years ago
- Status changed from In Progress to Won't Fix
fixed in all supported ceph releases, closing
Updated by Ilsoo Byun about 4 years ago
I was recently contacted by the security team and they said that there is a security issue in this post. So can you remove the attached log in the second reply which starts with "tcmalloc errors are frequently occured ..."? I have no right to edit the written post.