Project

General

Profile

Bug #1805

OSD: fd leak

Added by Greg Farnum almost 8 years ago. Updated almost 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
Start date:
12/08/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

There's an fd leak in the OSD. It looks like it's probably related to doing lots of OSDMap advancements at once, based on the strace output relevant to opening fds:

8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_1/osdmap.9560__0_0A33AA18", O_WRONLY|O_CREAT, 0644) = 324
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_1/inc\\uosdmap.9561__0_A69D3917", O_WRONLY|O_CREAT, 0644) = 325
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_A/osdmap.9561__0_0A33ABA8", O_WRONLY|O_CREAT, 0644) = 326
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_A/inc\\uosdmap.9562__0_A69D3EA7", O_WRONLY|O_CREAT, 0644) = 327
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_7/osdmap.9562__0_0A33AB78", O_WRONLY|O_CREAT, 0644) = 328
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_7/inc\\uosdmap.9563__0_A69D3E77", O_WRONLY|O_CREAT, 0644) = 329
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_0/osdmap.9563__0_0A33A808", O_WRONLY|O_CREAT, 0644) = 330
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_0/inc\\uosdmap.9564__0_A69D3F07", O_WRONLY|O_CREAT, 0644) = 331
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_D/osdmap.9564__0_0A33A9D8", O_WRONLY|O_CREAT, 0644) = 332
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_D/inc\\uosdmap.9565__0_A69D3CD7", O_WRONLY|O_CREAT, 0644) = 333
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_6/osdmap.9565__0_0A33A968", O_WRONLY|O_CREAT, 0644) = 334
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_6/inc\\uosdmap.9566__0_A69D3C67", O_WRONLY|O_CREAT, 0644) = 335
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_3/osdmap.9566__0_0A33AE38", O_WRONLY|O_CREAT, 0644) = 336
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_3/inc\\uosdmap.9567__0_A69D3D37", O_WRONLY|O_CREAT, 0644) = 337
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_C/osdmap.9567__0_0A33AFC8", O_WRONLY|O_CREAT, 0644) = 338
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_C/inc\\uosdmap.9568__0_A69D32C7", O_WRONLY|O_CREAT, 0644) = 339
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_9/osdmap.9568__0_0A33AC98", O_WRONLY|O_CREAT, 0644) = 340
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_9/inc\\uosdmap.9569__0_A69D3397", O_WRONLY|O_CREAT, 0644) = 341
8757  open("/mnt/osd.21/current/meta/DIR_8/DIR_2/osdmap.9569__0_0A33AC28", O_WRONLY|O_CREAT, 0644) = 342
8757  open("/mnt/osd.21/current/meta/DIR_7/DIR_F/inc\\uosdmap.9570__0_A69D30F7", O_WRONLY|O_CREAT, 0644) = 343

But it could be something else.

History

#1 Updated by Greg Farnum almost 8 years ago

  • Status changed from In Progress to Need More Info
  • Assignee deleted (Greg Farnum)

sigh It appears that I didn't manage to gather the correlated data that I thought I did. After an audit of who uses fds in the code base, and checking over the strace logs that I have, and doing a lot of data correlation on that and on the filestore logs...it looks like they're all fine and the large jumps in descriptor numbers aren't being logged anywhere, which points to the messenger. I've turned up a small piece of the messenger debugging so we can at least see socket allocation if this occurs somewhere useful in the future.

And if we're lucky it'll get dealt with by handling #1803.

#2 Updated by Greg Farnum almost 8 years ago

  • Status changed from Need More Info to Rejected

I was trying to figure out why the OSD was generating ~600 new sessions in the 4.5 seconds after starting up, when I realized that there had been ~600 radosgw-admin instances running against alexandria attempting to get stats on down PGs. 600 socket descriptors plus the filestore-allowed 512 file descriptors is greater than the OS-allowed 1024 descriptors.
So this illustrates an eventual scaling problem, but is not actually a leak.

Also available in: Atom PDF