Bug #43039: client: shutdown race fails with status 141 - CephFS - Ceph

Actions

Copy link

Bug #43039

closed

client: shutdown race fails with status 141

Added by Patrick Donnelly over 4 years ago. Updated almost 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Xiubo Li

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Tags:

Backport:

octopus,nautilus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

qa-suite

Labels (FS):

Pull request ID:

36515

Crash signature (v1):

Crash signature (v2):

Description

2019-11-26T11:33:09.502 INFO:tasks.workunit.client.0.smithi067.stdout:[ RUN      ] LibCephFS.ShutdownRace
2019-11-26T11:33:11.910 DEBUG:teuthology.orchestra.run:got remote process result: 141
2019-11-26T11:33:11.910 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...
2019-11-26T11:33:11.911 INFO:teuthology.orchestra.run.smithi067:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-11-26T11:33:12.126 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191126.005014/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 87, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 37, in resurrect_traceback
    reraise(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 24, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191126.005014/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 433, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 158, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 180, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test libcephfs/test.sh) on smithi067 with status 141: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bd192f5b5e5ae468f12d6612b5c3fa83e8e7be38 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh'

From: /ceph/teuthology-archive/pdonnell-2019-11-26_04:58:35-fs-wip-pdonnell-testing-20191126.005014-distro-basic-smithi/4543849/teuthology.log

2019-11-26T11:33:11.882+0000 7fb6d9fdb700  2 client.6481 unmounted.
2019-11-26T11:33:11.882+0000 7fb6d9fdb700  1 client.6481 shutdown
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 msgr2=0x7fb7bd6aebc0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 0x7fb7bd6aebc0 secure :-1 s=READY pgs=759 cs=0 l=1 rx=0x7fb75cb9a450 tx=0x7fb75caba6f0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 shutdown_connections
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.22:6834/3196009262,v1:172.21.15.22:6835/3196009262] conn(0x7fb7bcad2ba0 0x7fb7bd75bab0 unknown :-1 s=CLOSED pgs=315 cs=0 l=0 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 0x7fb7bd6aebc0 unknown :-1 s=CLOSED pgs=759 cs=0 l=1 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.22:3301/0,v1:172.21.15.22:6790/0] conn(0x7fb7bcb72f80 0x7fb7bc01efd0 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 shutdown_connections
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 wait complete.
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 >> 172.21.15.67:0/3399245369 conn(0x7fb7bd6bbc00 msgr2=0x7fb7bc01e5e0 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-26T11:33:11.883+0000 7fb6d9fdb700 20 client.6481 trim_cache size 0 max 16384

From: /ceph/teuthology-archive/pdonnell-2019-11-26_04:58:35-fs-wip-pdonnell-testing-20191126.005014-distro-basic-smithi/4543849/remote/smithi067/log/ceph-client.admin.40393.log.gz

Related issues 4 (1 open — 3 closed)

Actions

Copy link

Updated by Jeff Layton over 4 years ago

I think that's probably indicative of a SIGPIPE error, which probably means some task was writing to a pipe that did not have a reader.

Last week, I was getting some failures just after the ShutdownRace test ran. In that test we increase RLIMIT_NOFILE to the max, and then reduce it once the test is done. What I was seeing was that as soon as the old RLIMIT_NOFILE was restored, the next test after ShutdownRace would fail with EMFILE when it attempted to do an open().

I think there may have been some changes to the low-level messaging code that cause it to keep fd's open for longer than it used to. close() calls seem to have been deferred to a different task, and now when we go to reset RLIMIT_NOFILE after the ShutdownRace test it sometimes causes subsequent open calls to fail because the process is still over the limit.

Actions

Copy link

Updated by Jeff Layton over 4 years ago

Assignee changed from Jeff Layton to Patrick Donnelly

(Handing back to Patrick for now)

Is this problem still occurring in teuthology?

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Status changed from New to Need More Info
Assignee deleted (~~Patrick Donnelly~~)

Jeff Layton wrote:

(Handing back to Patrick for now)

Is this problem still occurring in teuthology?

Havne't seen it again. Let's see if it shows up again...

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Target version deleted (~~v15.0.0~~)

Actions

Copy link

Updated by Patrick Donnelly about 4 years ago

Status changed from Need More Info to New
Assignee set to Jeff Layton
Target version set to v16.0.0
Backport changed from nautilus to octopus,nautilus

/ceph/teuthology-archive/pdonnell-2020-02-15_16:51:06-fs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/4767574/teuthology.log

Actions

Copy link

Hi Jeff, Patrick

From my test locally, this issue is very similar to https://tracker.ceph.com/issues/45829, which dues to the open files reach up to the limit.

Locally I can reproduce the core dump very easily without any change in src/tests/libcephfs/test.c and I have add some log info about the open file limit:

[ RUN      ] LibCephFS.ShutdownRace
rold.rlim_cur:1024 rold.rlim_max:262144

The limitation is 1024 is too small, this is the reason why I can reproduce it so easy.

After setting it to 262144 at the end of `ShutdownRace()` test case, I couldn't reproduce it any more.

In the teuthology case, I think we should increase the limit to a larger number, just like in https://github.com/ceph/teuthology/pull/1508.

This could also explain why the core dump info are different many times.

Actions

Copy link

#12

Updated by Xiubo Li over 3 years ago

The root cause should be the test process reached the open files limit, and the fd returned is -1, or something, when accessing the FileEvent vector by using it, it trigger the following abort trace:

/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = EventCenter::FileEvent; _Alloc = std::allocator<EventCenter::FileEvent>; std::vector<_Tp, _Alloc>::reference = EventCenter::FileEvent&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
Aborted (core dumped)

Since we have enabled `_GLIBCXX_ASSERTIONS`, then we will always see the above Abort logs.

Actions

Copy link

#13

Updated by Xiubo Li over 3 years ago

Pull request ID set to 36515

Actions

Copy link

#14

Updated by Xiubo Li over 3 years ago

Status changed from New to Fix Under Review
Assignee set to Xiubo Li

Actions

Copy link

#15

Updated by Patrick Donnelly over 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#16

Updated by Nathan Cutler over 3 years ago

Copied to Backport #47020: nautilus: client: shutdown race fails with status 141 added

Actions

Copy link

#17

Updated by Nathan Cutler over 3 years ago

Copied to Backport #47021: octopus: client: shutdown race fails with status 141 added

Actions

Copy link

#18

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #50495: libcephfs: shutdown race fails with status 141 added

Actions

Copy link

#19

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #43039

client: shutdown race fails with status 141

Updated by Jeff Layton over 4 years ago

Updated by Jeff Layton over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Jeff Layton about 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly almost 4 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Xiubo Li over 3 years ago

Updated by Xiubo Li over 3 years ago

Updated by Xiubo Li over 3 years ago

Updated by Xiubo Li over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Nathan Cutler over 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Loïc Dachary almost 3 years ago