Project

General

Profile

Bug #49559

libcephfs: test termination "what(): Too many open files"

Added by Patrick Donnelly 9 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-02-27T03:49:22.989 INFO:tasks.workunit.client.0.smithi098.stdout:[ RUN      ] LibCephFS.ShutdownRace
2021-02-27T03:49:33.959 INFO:tasks.workunit.client.0.smithi098.stderr:terminate called after throwing an instance of 'std::system_error'
2021-02-27T03:49:37.611 INFO:tasks.workunit.client.0.smithi098.stderr:  what():  Too many open files
2021-02-27T03:49:37.612 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:49:37.614 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:09.499 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:09.503 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:40.189 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:40.193 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:52.141 DEBUG:teuthology.orchestra.run:got remote process result: 134
2021-02-27T03:50:52.143 INFO:tasks.workunit.client.0.smithi098.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh: line 3: 42514 Aborted                 (core dumped) ceph_test_libcephfs
2021-02-27T03:50:52.143 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...

From: /ceph/teuthology-archive/pdonnell-2021-02-26_23:40:39-fs-wip-pdonnell-testing-20210226.181017-distro-basic-smithi/5917531/teuthology.log


Related issues

Copied to CephFS - Backport #49934: octopus: libcephfs: test termination "what(): Too many open files" Resolved
Copied to CephFS - Backport #49935: pacific: libcephfs: test termination "what(): Too many open files" Resolved

History

#1 Updated by Xiubo Li 9 months ago

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

#2 Updated by Patrick Donnelly 9 months ago

  • Status changed from New to Triaged
  • Assignee set to Jeff Layton

#3 Updated by Jeff Layton 9 months ago

Xiubo Li wrote:

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

On my box, the hard limit for open files is 1048576. It seems sort of crazy that we end up with 8k fds per thread. I have to wonder if there is a fd leak or something.

Does teuthology have a lower rlim_max for this? It might be good in the interim to change this code to dump out what the RLIMIT_NOFILE value is so we can sanity check this.

#4 Updated by Patrick Donnelly 9 months ago

Jeff Layton wrote:

Xiubo Li wrote:

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

On my box, the hard limit for open files is 1048576. It seems sort of crazy that we end up with 8k fds per thread. I have to wonder if there is a fd leak or something.

That's possible.

Does teuthology have a lower rlim_max for this? It might be good in the interim to change this code to dump out what the RLIMIT_NOFILE value is so we can sanity check this.

This and perhaps a dump of `/proc/self/fd` so we can see what fds are in use and how they are growing (Detect a leak).

#5 Updated by Jeff Layton 9 months ago

Logging the RLIMIT_NOFILE we set should be no problem.

It may be tough to get a file descriptor in the same process to open /proc/self/fd when this occurs. This is also occurring via exception, AFAICT, so we don't really know what the testcase was doing when it triggered.

Do we get coredumps when these failures occur? We might be better off opening those up in gdb and seeing what was going on.

#6 Updated by Jeff Layton 9 months ago

We still have the coredump from this test failure, but the x86_64 binaries have been reaped so we can't analyze it. I've proposed a PR to make the test log the RLIMIT_NOFILE before we start the test, but I don't really expect it to be smaller than 1M.

What we'll probably need to do is just wait for this to happen again, and take a look at the open file table in the coredump. That might give us a clearer picture of why this keeps happening.

#7 Updated by Xiubo Li 9 months ago

It seems the tests will fire many event works, which will open many fds, the last issue about this was cause by the event work code.

#8 Updated by Jeff Layton 9 months ago

Xiubo Li wrote:

It seems the tests will fire many event works, which will open many fds, the last issue about this was cause by the event work code.

Maybe that is it. I haven't looked at the event handling code in detail.

I wonder too if it's possible that some fds are being lazily cleaned up after ceph_shutdown() is called, and so we end up with a bunch of unused fds still open after several loop iterations.

Debugging this is tricky since we can't count on being able to open new files from the process once this happens. I looked at the core, but getting proper debuginfo from the testcase is is hard since that's built during the test. We'd need a separate watchdog process to run lsof on the testcase when this occurs.

Reducing the number of threads sounds like a reasonable thing to do for now. I'll also note that I tried to run this test in a podman container and I hit cgroup thread limits. I'll start with 32 threads.

#9 Updated by Patrick Donnelly 8 months ago

  • Status changed from Triaged to Pending Backport

#10 Updated by Backport Bot 8 months ago

  • Copied to Backport #49934: octopus: libcephfs: test termination "what(): Too many open files" added

#11 Updated by Backport Bot 8 months ago

  • Copied to Backport #49935: pacific: libcephfs: test termination "what(): Too many open files" added

#12 Updated by Patrick Donnelly 8 months ago

  • Pull request ID set to 40192

#13 Updated by Loïc Dachary 7 months ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF