Bug #47842
qa: "fsstress.sh: line 16: 28870 Bus error (core dumped) "$BIN" -d "$T" -l 1 -n 1000 -p 10 -v"
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, qa-suite
Labels (FS):
crash, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
2020-10-09T17:49:16.750 INFO:tasks.workunit.client.0.smithi132.stdout:9/421: symlink d8/l8a 108 2020-10-09T17:49:16.763 INFO:tasks.workunit.client.0.smithi132.stdout:6/950: read - open d3/d18/d5d/dcd/fe5 failed 108 2020-10-09T17:49:16.791 INFO:tasks.workunit.client.0.smithi132.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh: line 16: 28870 Bus error (core dumped) "$BIN" -d "$T" -l 1 -n 1000 -p 10 -v 2020-10-09T17:49:16.792 DEBUG:teuthology.orchestra.run:got remote process result: 135 2020-10-09T17:49:16.811 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0... 2020-10-09T17:49:16.812 INFO:teuthology.orchestra.run.smithi132:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0 2020-10-09T17:49:17.044 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 147, in task cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__ resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback raise exc.exc_info[1] File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 425, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status node=self.hostname, label=self.label teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi132 with status 135: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e783bcc4ed94e312ec724570ede3e64902f921b4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh' 2020-10-09T17:49:17.083 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/sentry/sepia/?query=66dccb8cc8bc418ebf4439b82620b30e Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task return task(**kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 147, in task cleanup=cleanup) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients timeout=timeout) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__ for result in self: File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__ resurrect_traceback(result) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback raise exc.exc_info[1] File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback return func(*args, **kwargs) File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 425, in _run_tests label="workunit test {workunit}".format(workunit=workunit) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status node=self.hostname, label=self.label teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi132 with status 135: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e783bcc4ed94e312ec724570ede3e64902f921b4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh' 2020-10-09T17:49:17.084 DEBUG:teuthology.run_tasks:Unwinding manager ceph-fuse 2020-10-09T17:49:17.096 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients... 2020-10-09T17:49:17.096 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi132.front.sepia.ceph.com... 2020-10-09T17:49:17.097 INFO:teuthology.orchestra.run:Running command with timeout 1800 2020-10-09T17:49:17.097 INFO:teuthology.orchestra.run.smithi132:> (cd /home/ubuntu/cephtest && exec sudo fusermount -u /home/ubuntu/cephtest/mnt.0) 2020-10-09T17:49:17.151 INFO:teuthology.orchestra.run:waiting for 900 2020-10-09T17:49:17.152 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option. 2020-10-09T17:49:17.210 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:ceph-fuse[2020-10-09T17:49:17.152+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32 ... 2020-10-09T17:49:30.179 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option. 2020-10-09T17:49:30.322 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:29.692+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32 2020-10-09T17:49:30.334 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option. 2020-10-09T17:49:30.363 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:30.185+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32 2020-10-09T17:49:30.547 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option. 2020-10-09T17:49:30.684 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:30.340+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32 2020-10-09T17:49:30.703 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option. 2020-10-09T17:49:30.755 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6231-ge783bcc4/rpm/el8/BUILD/ceph-16.0.0-6231-ge783bcc4/src/client/Client.cc: In function 'int Client::_do_remount(bool)' thread 1d6d3700 time 2020-10-09T17:49:30.742652+0000 2020-10-09T17:49:30.756 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6231-ge783bcc4/rpm/el8/BUILD/ceph-16.0.0-6231-ge783bcc4/src/client/Client.cc: 4315: ceph_abort_msg("abort() called") 2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: ceph version 16.0.0-6231-ge783bcc4 (e783bcc4ed94e312ec724570ede3e64902f921b4) pacific (dev) 2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x52f8b18] 2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 2: (Client::_do_remount(bool)+0x3ce) [0x1a606e] 2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 3: (Context::complete(int)+0xd) [0x21127d] 2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 4: (Finisher::finisher_thread_entry()+0x28d) [0x539bb1d] 2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 5: /lib64/libpthread.so.0(+0x82de) [0xef062de] 2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 6: clone() 2020-10-09T17:49:30.779 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:*** Caught signal (Aborted) ** 2020-10-09T17:49:30.779 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: in thread 1d6d3700 thread_name:fn_anonymous 2020-10-09T17:49:30.815 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: ceph version 16.0.0-6231-ge783bcc4 (e783bcc4ed94e312ec724570ede3e64902f921b4) pacific (dev) 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0xef10dc0] 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 2: gsignal() 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 3: abort() 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x52f8be9] 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 5: (Client::_do_remount(bool)+0x3ce) [0x1a606e] 2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 6: (Context::complete(int)+0xd) [0x21127d] 2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 7: (Finisher::finisher_thread_entry()+0x28d) [0x539bb1d] 2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 8: /lib64/libpthread.so.0(+0x82de) [0xef062de] 2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 9: clone()
From: /ceph/teuthology-archive/pdonnell-2020-10-09_03:07:06-multimds-wip-pdonnell-testing-20201009.012350-distro-basic-smithi/5507591/teuthology.log
History
#1 Updated by Patrick Donnelly almost 3 years ago
- Labels (FS) qa-failure added
#2 Updated by Patrick Donnelly almost 3 years ago
- Status changed from New to Triaged
- Assignee set to Xiubo Li
#3 Updated by Xiubo Li almost 3 years ago
The client in smithi132 was blocklisted:
2020-10-09T17:48:56.320 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:48:56.199+0000 1a6cd700 -1 client.4454 I was blocklisted at osd epoch 25
#4 Updated by Xiubo Li almost 3 years ago
Xiubo Li wrote:
The client in smithi132 was blocklisted:
[...]
The mds.a evicted the client:
2020-10-09T17:48:55.356+0000 1948c700 10 mds.2.log trim 10 / 128 segments, 6551 / -1 events, 0 (0) expiring, 0 (0) expired 2020-10-09T17:48:55.357+0000 1948c700 10 mds.2.log _trim_expired_segments waiting for 1/4194304 to expire 2020-10-09T17:48:55.357+0000 1948c700 10 mds.2.server find_idle_sessions. last cleared laggy state 8042.06s ago 2020-10-09T17:48:55.357+0000 1948c700 20 mds.2.server evicting session client.4454 172.21.15.132:0/2210819714 since autoclose has arrived 2020-10-09T17:48:55.361+0000 1948c700 0 log_channel(cluster) log [WRN] : evicting unresponsive client (4454), after 300.725 seconds 2020-10-09T17:48:55.361+0000 1948c700 10 mds.2.server autoclosing stale session client.4454 172.21.15.132:0/2210819714 last renewed caps 300.725s ago 2020-10-09T17:48:55.363+0000 1948c700 1 mds.2.10 Evicting (and blocklisting) client session 4454 (172.21.15.132:0/2210819714) 2020-10-09T17:48:55.364+0000 1948c700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4454 (172.21.15.132:0/2210819714) 2020-10-09T17:48:55.364+0000 1948c700 4 mds.2.10 Preparing blocklist command... (wait=0) 2020-10-09T17:48:55.367+0000 1948c700 4 mds.2.10 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.132:0/2210819714"} 2020-10-09T17:48:55.373+0000 1948c700 1 -- [v2:172.21.15.79:6834/4235987290,v1:172.21.15.79:6835/4235987290] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mon_command({"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.132:0/2210819714"} v 0) v1 -- 0x1a6e8360 con 0x11278740
From the log, it says the client stopped renewing the cap, and after 'session_autoclose = 300' seconds the mds.a closed the sessions.
#5 Updated by Xiubo Li almost 3 years ago
From remote/smithi132/log/ceph-client.0.26463.log.gz, the client didn't send any caps renewing request to MDSs more than 300 seconds:
2020-10-09T17:43:54.586+0000 18eca700 2 client.4454 tick 2020-10-09T17:43:54.587+0000 18eca700 10 client.4454 renew_caps() 2020-10-09T17:43:54.587+0000 18eca700 15 client.4454 renew_caps requesting from mds.0 2020-10-09T17:43:54.587+0000 18eca700 10 client.4454 renew_caps mds.0 ... 2020-10-09T17:49:05.054+0000 18eca700 10 client.4454 renew_caps()
And then the mds.a daemon detected there is no any renew request more than get_session_autoclose(), which is 300s, then evicted it:
1088 if (last_cap_renew_span >= mds->mdsmap->get_session_autoclose()) { 1089 dout(20) << "evicting session " << session->info.inst << " since autoclose " 1090 "has arrived" << dendl; 1091 // evict session without marking it stale 1092 to_evict.push_back(session); 1093 continue; 1094 }
#6 Updated by Xiubo Li almost 3 years ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 37746
#7 Updated by Patrick Donnelly almost 3 years ago
- Status changed from Fix Under Review to Resolved