Bug #49559: libcephfs: test termination "what(): Too many open files" - CephFS - Ceph

Actions

Copy link

Bug #49559

closed

libcephfs: test termination "what(): Too many open files"

Added by Patrick Donnelly about 3 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Jeff Layton

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

Backport:

pacific,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

qa-suite

Labels (FS):

qa-failure

Pull request ID:

40192

Crash signature (v1):

Crash signature (v2):

Description

2021-02-27T03:49:22.989 INFO:tasks.workunit.client.0.smithi098.stdout:[ RUN      ] LibCephFS.ShutdownRace
2021-02-27T03:49:33.959 INFO:tasks.workunit.client.0.smithi098.stderr:terminate called after throwing an instance of 'std::system_error'
2021-02-27T03:49:37.611 INFO:tasks.workunit.client.0.smithi098.stderr:  what():  Too many open files
2021-02-27T03:49:37.612 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:49:37.614 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:09.499 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:09.503 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:40.189 DEBUG:teuthology.orchestra.run.smithi098:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:40.193 DEBUG:teuthology.orchestra.run.smithi192:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2021-02-27T03:50:52.141 DEBUG:teuthology.orchestra.run:got remote process result: 134
2021-02-27T03:50:52.143 INFO:tasks.workunit.client.0.smithi098.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh: line 3: 42514 Aborted                 (core dumped) ceph_test_libcephfs
2021-02-27T03:50:52.143 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...

From: /ceph/teuthology-archive/pdonnell-2021-02-26_23:40:39-fs-wip-pdonnell-testing-20210226.181017-distro-basic-smithi/5917531/teuthology.log

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Xiubo Li about 3 years ago

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from New to Triaged
Assignee set to Jeff Layton

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Xiubo Li wrote:

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

On my box, the hard limit for open files is 1048576. It seems sort of crazy that we end up with 8k fds per thread. I have to wonder if there is a fd leak or something.

Does teuthology have a lower rlim_max for this? It might be good in the interim to change this code to dump out what the RLIMIT_NOFILE value is so we can sanity check this.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Jeff Layton wrote:

Xiubo Li wrote:

IMO, for this we can lower down the concurent threads 128 --> 32 and can try it serveral times. From my test without Jeff's shutdown race fixing patches, it is very easy to reproduce with even 10 concurent threads.

On my box, the hard limit for open files is 1048576. It seems sort of crazy that we end up with 8k fds per thread. I have to wonder if there is a fd leak or something.

That's possible.

Does teuthology have a lower rlim_max for this? It might be good in the interim to change this code to dump out what the RLIMIT_NOFILE value is so we can sanity check this.

This and perhaps a dump of `/proc/self/fd` so we can see what fds are in use and how they are growing (Detect a leak).

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Logging the RLIMIT_NOFILE we set should be no problem.

It may be tough to get a file descriptor in the same process to open /proc/self/fd when this occurs. This is also occurring via exception, AFAICT, so we don't really know what the testcase was doing when it triggered.

Do we get coredumps when these failures occur? We might be better off opening those up in gdb and seeing what was going on.

Actions

Copy link

Updated by Jeff Layton about 3 years ago

We still have the coredump from this test failure, but the x86_64 binaries have been reaped so we can't analyze it. I've proposed a PR to make the test log the RLIMIT_NOFILE before we start the test, but I don't really expect it to be smaller than 1M.

What we'll probably need to do is just wait for this to happen again, and take a look at the open file table in the coredump. That might give us a clearer picture of why this keeps happening.

Actions

Copy link

Updated by Xiubo Li about 3 years ago

It seems the tests will fire many event works, which will open many fds, the last issue about this was cause by the event work code.

Actions

Copy link

Updated by Jeff Layton about 3 years ago

Xiubo Li wrote:

It seems the tests will fire many event works, which will open many fds, the last issue about this was cause by the event work code.

Maybe that is it. I haven't looked at the event handling code in detail.

I wonder too if it's possible that some fds are being lazily cleaned up after ceph_shutdown() is called, and so we end up with a bunch of unused fds still open after several loop iterations.

Debugging this is tricky since we can't count on being able to open new files from the process once this happens. I looked at the core, but getting proper debuginfo from the testcase is is hard since that's built during the test. We'd need a separate watchdog process to run lsof on the testcase when this occurs.

Reducing the number of threads sounds like a reasonable thing to do for now. I'll also note that I tried to run this test in a podman container and I hit cgroup thread limits. I'll start with 32 threads.

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from Triaged to Pending Backport

Actions

Copy link

#10

Updated by Backport Bot about 3 years ago

Copied to Backport #49934: octopus: libcephfs: test termination "what(): Too many open files" added

Actions

Copy link

#11

Updated by Backport Bot about 3 years ago

Copied to Backport #49935: pacific: libcephfs: test termination "what(): Too many open files" added

Actions

Copy link

#12

Updated by Patrick Donnelly about 3 years ago

Pull request ID set to 40192

Actions

Copy link

#13

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #49559

libcephfs: test termination "what(): Too many open files"

Updated by Xiubo Li about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Xiubo Li about 3 years ago

Updated by Jeff Layton about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Backport Bot about 3 years ago

Updated by Patrick Donnelly about 3 years ago

Updated by Loïc Dachary almost 3 years ago