Bug #1805
closedOSD: fd leak
0%
Description
There's an fd leak in the OSD. It looks like it's probably related to doing lots of OSDMap advancements at once, based on the strace output relevant to opening fds:
8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_1/osdmap.9560__0_0A33AA18", O_WRONLY|O_CREAT, 0644) = 324 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_1/inc\\uosdmap.9561__0_A69D3917", O_WRONLY|O_CREAT, 0644) = 325 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_A/osdmap.9561__0_0A33ABA8", O_WRONLY|O_CREAT, 0644) = 326 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_A/inc\\uosdmap.9562__0_A69D3EA7", O_WRONLY|O_CREAT, 0644) = 327 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_7/osdmap.9562__0_0A33AB78", O_WRONLY|O_CREAT, 0644) = 328 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_7/inc\\uosdmap.9563__0_A69D3E77", O_WRONLY|O_CREAT, 0644) = 329 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_0/osdmap.9563__0_0A33A808", O_WRONLY|O_CREAT, 0644) = 330 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_0/inc\\uosdmap.9564__0_A69D3F07", O_WRONLY|O_CREAT, 0644) = 331 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_D/osdmap.9564__0_0A33A9D8", O_WRONLY|O_CREAT, 0644) = 332 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_D/inc\\uosdmap.9565__0_A69D3CD7", O_WRONLY|O_CREAT, 0644) = 333 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_6/osdmap.9565__0_0A33A968", O_WRONLY|O_CREAT, 0644) = 334 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_6/inc\\uosdmap.9566__0_A69D3C67", O_WRONLY|O_CREAT, 0644) = 335 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_3/osdmap.9566__0_0A33AE38", O_WRONLY|O_CREAT, 0644) = 336 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_3/inc\\uosdmap.9567__0_A69D3D37", O_WRONLY|O_CREAT, 0644) = 337 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_C/osdmap.9567__0_0A33AFC8", O_WRONLY|O_CREAT, 0644) = 338 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_C/inc\\uosdmap.9568__0_A69D32C7", O_WRONLY|O_CREAT, 0644) = 339 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_9/osdmap.9568__0_0A33AC98", O_WRONLY|O_CREAT, 0644) = 340 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_9/inc\\uosdmap.9569__0_A69D3397", O_WRONLY|O_CREAT, 0644) = 341 8757 open("/mnt/osd.21/current/meta/DIR_8/DIR_2/osdmap.9569__0_0A33AC28", O_WRONLY|O_CREAT, 0644) = 342 8757 open("/mnt/osd.21/current/meta/DIR_7/DIR_F/inc\\uosdmap.9570__0_A69D30F7", O_WRONLY|O_CREAT, 0644) = 343
But it could be something else.
Updated by Greg Farnum over 12 years ago
- Status changed from In Progress to Need More Info
- Assignee deleted (
Greg Farnum)
sigh It appears that I didn't manage to gather the correlated data that I thought I did. After an audit of who uses fds in the code base, and checking over the strace logs that I have, and doing a lot of data correlation on that and on the filestore logs...it looks like they're all fine and the large jumps in descriptor numbers aren't being logged anywhere, which points to the messenger. I've turned up a small piece of the messenger debugging so we can at least see socket allocation if this occurs somewhere useful in the future.
And if we're lucky it'll get dealt with by handling #1803.
Updated by Greg Farnum over 12 years ago
- Status changed from Need More Info to Rejected
I was trying to figure out why the OSD was generating ~600 new sessions in the 4.5 seconds after starting up, when I realized that there had been ~600 radosgw-admin instances running against alexandria attempting to get stats on down PGs. 600 socket descriptors plus the filestore-allowed 512 file descriptors is greater than the OS-allowed 1024 descriptors.
So this illustrates an eventual scaling problem, but is not actually a leak.