Project

General

Profile

Bug #38479

multisite: memory growth from RGWCoroutinesStacks on lease errors

Added by Casey Bodley 4 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
02/25/2019
Due date:
% Done:

0%

Source:
Tags:
multisite
Backport:
luminous mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

When multiple gateways are running in the same zone of a multisite configuration, they use leases to coordinate with each other on sync. While one gateway holds a lease, other gateways continue to poll until it becomes available. Each failed poll attempt leaves behind a spawned coroutine stack, causing memory growth that doesn't get cleaned up until shutdown.

~/ceph/build $ bin/ceph daemon run/c2/out/radosgw.8002.asok cr dump | jq .coroutine_managers[0].run_contexts[0].entries[1].ops[0]
{                                                                                                                                                            
  "type": "25RGWDataSyncShardControlCR",
  "spawned": [
    "0x55ab5bdb7860",
    "0x55ab5c258780",
    "0x55ab5c3565a0",
    "0x55ab5c357680",
    "0x55ab5c357b30",
    "0x55ab5c364e10",
    "0x55ab5c3650e0",
    "0x55ab5c3652c0",
    "0x55ab5c3654a0",
    "0x55ab5c422e10",
    "0x55ab5c4231d0",
    "0x55ab5c4234a0",
    "0x55ab5c423770",
    "0x55ab5c4914a0",
    "0x55ab5c491770",
    "0x55ab5c491a40",
    "0x55ab5c491d10",
    "0x55ab5c5013b0",
    "0x55ab5c501680",
    "0x55ab5c501950",
    "0x55ab5c501c20",
    "0x55ab5c501ef0",
    "0x55ab5c57bd10",
    "0x55ab5b734960",
    "0x55ab5c5ec5a0",
    "0x55ab5c5ec690",
    "0x55ab5c5ec780",
    "0x55ab5c5ec870",
    "0x55ab5c5ec960",
    "0x55ab5c5eca50",
    "0x55ab5c5ecb40",
    "0x55ab5c5ecc30" 
  ]
}

https://github.com/ceph/ceph/pull/26639


Related issues

Copied to rgw - Backport #38529: luminous: multisite: memory growth from RGWCoroutinesStacks on lease errors Resolved
Copied to rgw - Backport #38530: mimic: multisite: memory growth from RGWCoroutinesStacks on lease errors Resolved

History

#1 Updated by Casey Bodley 4 months ago

  • Description updated (diff)
  • Status changed from In Progress to Need Review

#2 Updated by Casey Bodley 4 months ago

with the fix applied, the polling gateway remains in a steady state of 0 spawned stacks:

~/ceph/build $ bin/ceph daemon run/c2/out/radosgw.8002.asok cr dump | jq .coroutine_managers[1].run_contexts[0].entries[1].ops[0]
{
  "type": "25RGWDataSyncShardControlCR" 
}

#3 Updated by Casey Bodley 4 months ago

  • Status changed from Need Review to Pending Backport

#4 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #38529: luminous: multisite: memory growth from RGWCoroutinesStacks on lease errors added

#5 Updated by Nathan Cutler 4 months ago

  • Copied to Backport #38530: mimic: multisite: memory growth from RGWCoroutinesStacks on lease errors added

#6 Updated by Nathan Cutler about 1 month ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF