Project

General

Profile

Bug #45829

fs: ceph_test_libcephfs abort in TestUtime

Added by Patrick Donnelly about 1 month ago. Updated 26 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
tools
Labels (FS):
Pull request ID:
Crash signature:

Description

2020-06-02T00:39:35.510 INFO:tasks.workunit.client.0.smithi154.stdout:[ RUN      ] LibCephFS.TestUtime
2020-06-02T00:39:46.460 INFO:teuthology.orchestra.run.smithi068:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-02T00:39:46.483 INFO:teuthology.orchestra.run.smithi154:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-02T00:39:58.368 INFO:tasks.workunit.client.0.smithi154.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh: line 3: 31546 Aborted                 (core dumped) ceph_test_libcephfs
2020-06-02T00:39:58.369 DEBUG:teuthology.orchestra.run:got remote process result: 134

From: /ceph/teuthology-archive/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525/teuthology.log

Core dump is:

pdonnell@teuthology /a/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525$ file ./remote/smithi154/coredump/1591058375.31546.core
./remote/smithi154/coredump/1591058375.31546.core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), too many program headers (4887)
pdonnell@teuthology /a/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525$ realpath !$
realpath ./remote/smithi154/coredump/1591058375.31546.core
/ceph/teuthology-archive/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525/remote/smithi154/coredump/1591058375.31546.core

I haven't been able to load the coredump in gdb. I get this error:

pdonnell@senta03 ~$ sudo ceph/src/script/ceph-debug-docker.sh wip-pdonnell-testing-20200601.182437 centos:8
branch: wip-pdonnell-testing-20200601.182437
sha1: latest
env: centos:8
/tmp/tmp.hj1iEtHLnL /home/pdonnell
docker build --tag pdonnell:ceph-ci-wip-pdonnell-testing-20200601.182437-latest-centos-8 .
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
STEP 1: FROM centos:8
STEP 2: WORKDIR /root
--> Using cache f6af401813e10a57b4b790eab59a5fd5d2164ba54ec6975640cc9f17fc0ce88e
STEP 3: RUN yum update -y &&     yum install -y epel-release wget psmisc ca-certificates gdb
--> Using cache ef44b584002a001343bf93c50f4a830645a4ff14d00699c7901a0eeb3696bd9c
STEP 4: RUN wget -O /etc/yum.repos.d/ceph-dev.repo https://shaman.ceph.com/api/repos/ceph/wip-pdonnell-testing-20200601.182437/latest/centos/8/repo &&     yum clean all &&     yum upgrade -y
--> Using cache 9d6822b69a298960d5b05460549e0ed25e7201905a19cc9db9315d51f00319bc
STEP 5: COMMIT pdonnell:ceph-ci-wip-pdonnell-testing-20200601.182437-latest-centos-8
9d6822b69a298960d5b05460549e0ed25e7201905a19cc9db9315d51f00319bc

real    0m2.825s
user    0m0.236s
sys     0m0.082s
/home/pdonnell
built image pdonnell:ceph-ci-wip-pdonnell-testing-20200601.182437-latest-centos-8
sudo docker run --privileged -ti -v /ceph:/ceph:ro pdonnell:ceph-ci-wip-pdonnell-testing-20200601.182437-latest-centos-8
Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
[root@264a2f98e22b ~]# gdb -q /ceph/teuthology-archive/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525/remote/smithi154/coredump/1591058375.31546.core
"/ceph/teuthology-archive/pdonnell-2020-06-01_22:23:46-fs-wip-pdonnell-testing-20200601.182437-distro-basic-smithi/5110525/remote/smithi154/coredump/1591058375.31546.core": not in executable format: Operation not permitted
(gdb) 

Couldn't find any helpful information on why that would be. Maybe it's because it has too many elf program headers.

I don't think it's related to https://github.com/ceph/ceph/pull/26004

History

#1 Updated by Patrick Donnelly about 1 month ago

Another in a different master test branch:

/ceph/teuthology-archive/pdonnell-2020-06-02_19:51:19-fs-wip-pdonnell-testing-20200602.153347-distro-basic-smithi/5114800/teuthology.log

without #26004. It looks very reproducible: http://pulpito.ceph.com/pdonnell-2020-06-02_19:51:19-fs-wip-pdonnell-testing-20200602.153347-distro-basic-smithi/

#2 Updated by Patrick Donnelly about 1 month ago

  • Status changed from New to Triaged
  • Assignee set to Xiubo Li
  • Priority changed from Normal to High
  • Component(FS) tools added

#3 Updated by Xiubo Li about 1 month ago

  • Status changed from Triaged to In Progress

I can access to the lab now, will continue work on this.

#4 Updated by Xiubo Li 28 days ago

From one of my test case's teuthology.log, we can see that, it is aborted when doing the mount(), in create_file_event() and adding the event:

2020-06-15T05:50:42.948 INFO:tasks.workunit.client.0.smithi200.stdout:[ RUN      ] LibCephFS.TestUtime
2020-06-15T05:50:42.960 INFO:tasks.workunit.client.0.smithi200.stderr:/build/ceph-16.0.0-2443-g3ecd7ea/src/msg/async/Event.cc: In function 'int EventCenter::create_file_event(int, int, EventCallbackRef)' thread 7f8ce57fa700 time 2020-06-15T05:50:42.957161+0000
2020-06-15T05:50:42.961 INFO:tasks.workunit.client.0.smithi200.stderr:/build/ceph-16.0.0-2443-g3ecd7ea/src/msg/async/Event.cc: 239: ceph_abort_msg("BUG!")
2020-06-15T05:50:42.962 INFO:tasks.workunit.client.0.smithi200.stderr: ceph version 16.0.0-2443-g3ecd7ea (3ecd7ea32af62d2ccc433d77c2e4f44ea29c6d99) pacific (dev)
2020-06-15T05:50:42.962 INFO:tasks.workunit.client.0.smithi200.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe1) [0x7f8cf3c85c08]
2020-06-15T05:50:42.962 INFO:tasks.workunit.client.0.smithi200.stderr: 2: (EventCenter::create_file_event(int, int, EventCallback*)+0xbaf) [0x7f8cf3f719cf]
2020-06-15T05:50:42.962 INFO:tasks.workunit.client.0.smithi200.stderr: 3: (EventCenter::set_owner()+0x3c8) [0x7f8cf3f71e08]
2020-06-15T05:50:42.963 INFO:tasks.workunit.client.0.smithi200.stderr: 4: (()+0x564f55) [0x7f8cf3f75f55]
2020-06-15T05:50:42.963 INFO:tasks.workunit.client.0.smithi200.stderr: 5: (()+0xbd6df) [0x7f8cf2fc26df]
2020-06-15T05:50:42.963 INFO:tasks.workunit.client.0.smithi200.stderr: 6: (()+0x76db) [0x7f8cf349d6db]
2020-06-15T05:50:42.963 INFO:tasks.workunit.client.0.smithi200.stderr: 7: (clone()+0x3f) [0x7f8cf2a1d88f]
2020-06-15T05:50:59.241 INFO:teuthology.orchestra.run.smithi092:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-15T05:50:59.244 INFO:teuthology.orchestra.run.smithi200:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-15T05:51:26.489 INFO:tasks.workunit.client.0.smithi200.stderr:Aborted (core dumped)
2020-06-15T05:51:26.491 DEBUG:teuthology.orchestra.run:got remote process result: 134
2020-06-15T05:51:26.492 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...

The link: http://pulpito.ceph.com/xiubli-2020-06-15_05:16:00-fs:basic_workload-wip-xiubli-fs-testing-2020-06-12-1613-distro-basic-smithi/5148868

#5 Updated by Xiubo Li 28 days ago

http://pulpito.ceph.com/xiubli-2020-06-15_05:16:00-fs:basic_workload-wip-xiubli-fs-testing-2020-06-12-1613-distro-basic-smithi/5148913

For this test case it crashed in ShutDownrace instead:

2020-06-15T05:55:19.089 INFO:tasks.workunit.client.0.smithi099.stdout:[ RUN      ] LibCephFS.ShutdownRace
2020-06-15T05:55:28.584 INFO:teuthology.orchestra.run.smithi052:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-15T05:55:28.588 INFO:teuthology.orchestra.run.smithi099:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-15T05:55:29.893 DEBUG:teuthology.orchestra.run:got remote process result: 141
2020-06-15T05:55:29.893 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...
2020-06-15T05:55:29.894 INFO:teuthology.orchestra.run.smithi099:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2020-06-15T05:55:30.072 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-xiubli-fs-testing-2020-06-12-1613/qa/tasks/workunit.py", line 127, in task
    coverage_and_limits=not config.get('no_coverage_and_limits', None))
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 87, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 101, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 37, in resurrect_traceback
    reraise(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 24, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-xiubli-fs-testing-2020-06-12-1613/qa/tasks/workunit.py", line 415, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 206, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 475, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test libcephfs/test.sh) on smithi099 with status 141: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=3ecd7ea32af62d2ccc433d77c2e4f44ea29c6d99 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh'

#6 Updated by Xiubo Li 27 days ago

http://pulpito.ceph.com/xiubli-2020-06-16_14:44:06-fs:basic_workload-wip-fs-ulimit-2020-06-16-1511-distro-basic-smithi/5153653/

The fd is used before initialized:


2020-06-16T15:07:35.708 INFO:tasks.workunit.client.0.smithi205.stdout:[ RUN      ] LibCephFS.TestUtime
2020-06-16T15:07:35.712 INFO:tasks.workunit.client.0.smithi205.stderr:lxb ---------- 1
2020-06-16T15:07:35.715 INFO:tasks.workunit.client.0.smithi205.stderr:lxb ---------- 2
2020-06-16T15:07:35.718 INFO:tasks.workunit.client.0.smithi205.stderr:lxb ---------- 3
2020-06-16T15:07:35.718 INFO:tasks.workunit.client.0.smithi205.stderr:lxb ---------- 4
2020-06-16T15:07:35.727 INFO:tasks.workunit.client.0.smithi205.stderr:create_file_eventlxb add event failed, ret=-9 fd=-1 mask=1 original mask is 7536755
2020-06-16T15:07:41.603 INFO:teuthology.orchestra.run.smithi089:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-16T15:07:41.608 INFO:teuthology.orchestra.run.smithi205:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-16T15:08:13.859 INFO:teuthology.orchestra.run.smithi089:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-16T15:08:13.875 INFO:teuthology.orchestra.run.smithi205:> sudo logrotate /etc/logrotate.d/ceph-test.conf
2020-06-16T15:08:22.886 INFO:tasks.workunit.client.0.smithi205.stderr:Segmentation fault (core dumped)
2020-06-16T15:08:22.887 DEBUG:teuthology.orchestra.run:got remote process result: 139
2020-06-16T15:08:22.888 INFO:tasks.workunit:Stopping ['libcephfs/test.sh'] on client.0...

#7 Updated by Xiubo Li 26 days ago

http://pulpito.ceph.com/xiubli-2020-06-17_03:41:52-fs:basic_workload-wip-fs-fderr-2020-06-17-1011-distro-basic-smithi/5156392/

It is too many open files and reach the limitation:


2020-06-17T04:09:29.434 INFO:tasks.workunit.client.0.smithi038.stdout:[ RUN      ] LibCephFS.TestUtime
2020-06-17T04:09:29.435 INFO:tasks.workunit.client.0.smithi038.stderr:lxb ---------- 1
2020-06-17T04:09:29.435 INFO:tasks.workunit.client.0.smithi038.stderr:lxb ---------- 2
2020-06-17T04:09:29.436 INFO:tasks.workunit.client.0.smithi038.stderr:lxb ---------- 3
2020-06-17T04:09:29.436 INFO:tasks.workunit.client.0.smithi038.stderr:lxb ---------- 4
2020-06-17T04:09:29.657 INFO:tasks.workunit.client.0.smithi038.stderr:init can't create notify pipe: (24) Too many open files
2020-06-17T04:09:29.658 INFO:tasks.workunit.client.0.smithi038.stderr:NetworkStack lxb w->center.init, ret=-24
2020-06-17T04:09:29.658 INFO:tasks.workunit.client.0.smithi038.stderr:NetworkStack lxb w->center.init, ret=-24
2020-06-17T04:09:29.658 INFO:tasks.workunit.client.0.smithi038.stderr:NetworkStack lxb w->center.init, ret=-24
2020-06-17T04:09:29.658 INFO:tasks.workunit.client.0.smithi038.stderr:NetworkStack lxb w->center.init, ret=-24

#8 Updated by Xiubo Li 26 days ago

  • Status changed from In Progress to Fix Under Review

#9 Updated by Kefu Chai 26 days ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF