Project

General

Profile

Bug #43039

client: shutdown race fails with status 141

Added by Patrick Donnelly 7 months ago. Updated 3 days ago.

Status:
New
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature:

Description

2019-11-26T11:33:09.502 INFO:tasks.workunit.client.0.smithi067.stdout:[ RUN      ] LibCephFS.ShutdownRace
2019-11-26T11:33:11.910 DEBUG:teuthology.orchestra.run:got remote process result: 141
2019-11-26T11:33:11.910 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...
2019-11-26T11:33:11.911 INFO:teuthology.orchestra.run.smithi067:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2019-11-26T11:33:12.126 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191126.005014/qa/tasks/workunit.py", line 123, in task
    timeout=timeout,cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 87, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 37, in resurrect_traceback
    reraise(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 24, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20191126.005014/qa/tasks/workunit.py", line 409, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 198, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 433, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 158, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 180, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test libcephfs/test.sh) on smithi067 with status 141: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=bd192f5b5e5ae468f12d6612b5c3fa83e8e7be38 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh'

From: /ceph/teuthology-archive/pdonnell-2019-11-26_04:58:35-fs-wip-pdonnell-testing-20191126.005014-distro-basic-smithi/4543849/teuthology.log

2019-11-26T11:33:11.882+0000 7fb6d9fdb700  2 client.6481 unmounted.
2019-11-26T11:33:11.882+0000 7fb6d9fdb700  1 client.6481 shutdown
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 msgr2=0x7fb7bd6aebc0 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 0x7fb7bd6aebc0 secure :-1 s=READY pgs=759 cs=0 l=1 rx=0x7fb75cb9a450 tx=0x7fb75caba6f0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 shutdown_connections
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.22:6834/3196009262,v1:172.21.15.22:6835/3196009262] conn(0x7fb7bcad2ba0 0x7fb7bd75bab0 unknown :-1 s=CLOSED pgs=315 cs=0 l=0 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.67:3300/0,v1:172.21.15.67:6789/0] conn(0x7fb7bd749d40 0x7fb7bd6aebc0 unknown :-1 s=CLOSED pgs=759 cs=0 l=1 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 --2- 172.21.15.67:0/3399245369 >> [v2:172.21.15.22:3301/0,v1:172.21.15.22:6790/0] conn(0x7fb7bcb72f80 0x7fb7bc01efd0 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 shutdown_connections
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 wait complete.
2019-11-26T11:33:11.883+0000 7fb6d9fdb700  1 -- 172.21.15.67:0/3399245369 >> 172.21.15.67:0/3399245369 conn(0x7fb7bd6bbc00 msgr2=0x7fb7bc01e5e0 unknown :-1 s=STATE_NONE l=0).mark_down
2019-11-26T11:33:11.883+0000 7fb6d9fdb700 20 client.6481 trim_cache size 0 max 16384

From: /ceph/teuthology-archive/pdonnell-2019-11-26_04:58:35-fs-wip-pdonnell-testing-20191126.005014-distro-basic-smithi/4543849/remote/smithi067/log/ceph-client.admin.40393.log.gz

History

#1 Updated by Jeff Layton 7 months ago

I think that's probably indicative of a SIGPIPE error, which probably means some task was writing to a pipe that did not have a reader.

Last week, I was getting some failures just after the ShutdownRace test ran. In that test we increase RLIMIT_NOFILE to the max, and then reduce it once the test is done. What I was seeing was that as soon as the old RLIMIT_NOFILE was restored, the next test after ShutdownRace would fail with EMFILE when it attempted to do an open().

I think there may have been some changes to the low-level messaging code that cause it to keep fd's open for longer than it used to. close() calls seem to have been deferred to a different task, and now when we go to reset RLIMIT_NOFILE after the ShutdownRace test it sometimes causes subsequent open calls to fail because the process is still over the limit.

#2 Updated by Jeff Layton 7 months ago

  • Assignee changed from Jeff Layton to Patrick Donnelly

(Handing back to Patrick for now)

Is this problem still occurring in teuthology?

#3 Updated by Patrick Donnelly 7 months ago

  • Status changed from New to Need More Info
  • Assignee deleted (Patrick Donnelly)

Jeff Layton wrote:

(Handing back to Patrick for now)

Is this problem still occurring in teuthology?

Havne't seen it again. Let's see if it shows up again...

#4 Updated by Patrick Donnelly 6 months ago

  • Target version deleted (v15.0.0)

#5 Updated by Patrick Donnelly 5 months ago

  • Status changed from Need More Info to New
  • Assignee set to Jeff Layton
  • Target version set to v16.0.0
  • Backport changed from nautilus to octopus,nautilus

/ceph/teuthology-archive/pdonnell-2020-02-15_16:51:06-fs-wip-pdonnell-testing-20200215.033325-distro-basic-smithi/4767574/teuthology.log

#6 Updated by Jeff Layton 5 months ago

I took a look at the logs but there is nothing conclusive there. Again, I suspect that this is a problem down in the low-level messenger code -- we're probably hitting RLIMIT_NOFILE somehow.

#7 Updated by Jeff Layton 3 months ago

  • Status changed from New to Can't reproduce

I haven't heard of this cropping up anymore, and it seemed like more of a problem in low-level message handling code rather than anything in cephfs. Closing.

#8 Updated by Patrick Donnelly 3 days ago

  • Status changed from Can't reproduce to New
  • Assignee deleted (Jeff Layton)

/ceph/teuthology-archive/pdonnell-2020-07-01_06:37:23-fs-wip-pdonnell-testing-20200701.033411-distro-basic-smithi/5193080/teuthology.log

Also available in: Atom PDF