Project

General

Profile

Actions

Bug #54482

closed

octopus: heap memory leak in radosgw

Added by Tobias Urdin about 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rgw
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I've been trying to narrow down why radosgw is hogging so much memory first on 15.2.15 then I found https://github.com/ceph/ceph/pull/43381 so I upgraded radosgw to 15.2.16 but the issue is still there.

We are heavy users of the admin REST API to get buckets information etc.

I can see with valgrind's massif tool that the it increases incrementally between each snapshot so the longer the time goes the faster the memory usage increases exponentially.

I've attached the massif output as a file, it's the same on 15.2.15 and 15.2.16 I thought it was the above bugfix because this sticks out

--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
58 478,065,211,953 1,119,989,200 1,058,526,445 61,462,755 0
94.51% (1,058,526,445B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->72.80% (815,392,000B) 0x5629069: rgw::sal::RGWRadosUser::list_buckets(std::string const&, std::string const&, unsigned long, bool, rgw::sal::RGWBucketList&) (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x534B93E: RGWBucketAdminOp::info(rgw::sal::RGWRadosStore*, RGWBucketAdminOpState&, RGWFormatterFlusher&) (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x523561B: RGWOp_Bucket_Info::execute() (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x522DAA1: rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool) (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x52315C7: process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*) (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x519A9F8: RGWCivetWebFrontend::process(mg_connection*) (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x52EBA4D: ??? (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x52ED6EE: ??? (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0x52EDB97: ??? (in /usr/lib64/libradosgw.so.2.0.0)
->72.80% (815,392,000B) 0xFEA5EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
->72.80% (815,392,000B) 0x116409FC: clone (in /usr/lib64/libc-2.17.so)
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

This is how the graph looks like

    GB
1.051^                                                                       :
     |                                                                    @@:#
     |                                                              @@::::@ :#
     |                                                            :@@ :: :@ :#
     |                                                     @@::::::@@ :: :@ :#
     |                                                    @@ ::: ::@@ :: :@ :#
     |                                                  @@@@ ::: ::@@ :: :@ :#
     |                                                @@@ @@ ::: ::@@ :: :@ :#
     |                                            ::::@ @ @@ ::: ::@@ :: :@ :#
     |                                         @@::: :@ @ @@ ::: ::@@ :: :@ :#
     |                                  @@@::::@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |                               @::@@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |                             @@@: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |                       ::@:::@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |                  :::::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |               ::::: ::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |            :::: ::: ::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |     :::::::: :: ::: ::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |    ::: ::: : :: ::: ::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
     |    ::: ::: : :: ::: ::: @: :@ @: @@ :: :@ ::: :@ @ @@ ::: ::@@ :: :@ :#
   0 +----------------------------------------------------------------------->Gi
     0                                                                   448.2

Number of snapshots: 62
 Detailed snapshots: [18, 21, 22, 24, 25, 29, 34, 35, 36, 37, 43, 44, 48, 57, 58 (peak)]

Files

massif-output.txt (428 KB) massif-output.txt Tobias Urdin, 03/07/2022 10:41 AM
Actions #1

Updated by Tobias Urdin about 2 years ago

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 58 478,065,211,953    1,119,989,200    1,058,526,445    61,462,755            0
94.51% (1,058,526,445B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->72.80% (815,392,000B) 0x5629069: rgw::sal::RGWRadosUser::list_buckets(std::string const&, std::string const&, unsigned long, bool, rgw::sal::RGWBucketList&) (in /usr/lib64/libradosgw.so.2.0.0)
| ->72.80% (815,392,000B) 0x534B93E: RGWBucketAdminOp::info(rgw::sal::RGWRadosStore*, RGWBucketAdminOpState&, RGWFormatterFlusher&) (in /usr/lib64/libradosgw.so.2.0.0)
| | ->72.80% (815,392,000B) 0x523561B: RGWOp_Bucket_Info::execute() (in /usr/lib64/libradosgw.so.2.0.0)
| |   ->72.80% (815,392,000B) 0x522DAA1: rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool) (in /usr/lib64/libradosgw.so.2.0.0)
| |     ->72.80% (815,392,000B) 0x52315C7: process_request(rgw::sal::RGWRadosStore*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*) (in /usr/lib64/libradosgw.so.2.0.0)
| |       ->72.80% (815,392,000B) 0x519A9F8: RGWCivetWebFrontend::process(mg_connection*) (in /usr/lib64/libradosgw.so.2.0.0)
| |         ->72.80% (815,392,000B) 0x52EBA4D: ??? (in /usr/lib64/libradosgw.so.2.0.0)
| |           ->72.80% (815,392,000B) 0x52ED6EE: ??? (in /usr/lib64/libradosgw.so.2.0.0)
| |             ->72.80% (815,392,000B) 0x52EDB97: ??? (in /usr/lib64/libradosgw.so.2.0.0)
| |               ->72.80% (815,392,000B) 0xFEA5EA4: start_thread (in /usr/lib64/libpthread-2.17.so)
| |                 ->72.80% (815,392,000B) 0x116409FC: clone (in /usr/lib64/libc-2.17.so)
| |                   
| ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
| 

Actions #2

Updated by Tobias Urdin about 2 years ago

Seems like this is related to the admin API.

Actions #3

Updated by Tobias Urdin about 2 years ago

Should probably mention that we have thousands of buckets that we retrieve statistics for, but we don't expect memory used for these operations to stay behind like this.

Actions #4

Updated by Tobias Urdin about 2 years ago

After a while with:
GET /admin/bucket?uid=<uid here>&stats=true

Actions #5

Updated by Tobias Urdin about 2 years ago

This might be solved if we upgrade I would guess because of the major refactor in https://github.com/ceph/ceph/commit/99f7c4aa1286edfea6961b92bb44bb8fe22bd599 however that's not a feasible solution for this issue.

Actions #6

Updated by Casey Bodley about 2 years ago

  • Subject changed from heap memory leak in radosgw to octopus: heap memory leak in radosgw
  • Status changed from New to Fix Under Review
  • Pull request ID set to 45283
Actions #8

Updated by Casey Bodley almost 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF