Bug #47842: qa: "fsstress.sh: line 16: 28870 Bus error (core dumped) "$BIN" -d "$T" -l 1 -n 1000 -p 10 -v" - CephFS - Ceph

Actions

Copy link

Bug #47842

closed

qa: "fsstress.sh: line 16: 28870 Bus error (core dumped) "$BIN" -d "$T" -l 1 -n 1000 -p 10 -v"

Added by Patrick Donnelly over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Xiubo Li

Category:

Target version:

Ceph - v16.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client, qa-suite

Labels (FS):

crash, qa-failure

Pull request ID:

37746

Crash signature (v1):

Crash signature (v2):

Description

2020-10-09T17:49:16.750 INFO:tasks.workunit.client.0.smithi132.stdout:9/421: symlink d8/l8a 108
2020-10-09T17:49:16.763 INFO:tasks.workunit.client.0.smithi132.stdout:6/950: read - open d3/d18/d5d/dcd/fe5 failed 108
2020-10-09T17:49:16.791 INFO:tasks.workunit.client.0.smithi132.stderr:/home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh: line 16: 28870 Bus error               (core dumped) "$BIN" -d "$T" -l 1 -n 1000 -p 10 -v
2020-10-09T17:49:16.792 DEBUG:teuthology.orchestra.run:got remote process result: 135
2020-10-09T17:49:16.811 INFO:tasks.workunit:Stopping ['suites/fsstress.sh'] on client.0...
2020-10-09T17:49:16.812 INFO:teuthology.orchestra.run.smithi132:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2020-10-09T17:49:17.044 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi132 with status 135: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e783bcc4ed94e312ec724570ede3e64902f921b4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh'
2020-10-09T17:49:17.083 ERROR:teuthology.run_tasks: Sentry event: https://sentry.ceph.com/sentry/sepia/?query=66dccb8cc8bc418ebf4439b82620b30e
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 90, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 69, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20201009.012350/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 215, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 446, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 160, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 182, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test suites/fsstress.sh) on smithi132 with status 135: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e783bcc4ed94e312ec724570ede3e64902f921b4 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 6h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh'
2020-10-09T17:49:17.084 DEBUG:teuthology.run_tasks:Unwinding manager ceph-fuse
2020-10-09T17:49:17.096 INFO:tasks.ceph_fuse:Unmounting ceph-fuse clients...
2020-10-09T17:49:17.096 INFO:tasks.cephfs.fuse_mount:Running fusermount -u on ubuntu@smithi132.front.sepia.ceph.com...
2020-10-09T17:49:17.097 INFO:teuthology.orchestra.run:Running command with timeout 1800
2020-10-09T17:49:17.097 INFO:teuthology.orchestra.run.smithi132:> (cd /home/ubuntu/cephtest && exec sudo fusermount -u /home/ubuntu/cephtest/mnt.0)
2020-10-09T17:49:17.151 INFO:teuthology.orchestra.run:waiting for 900
2020-10-09T17:49:17.152 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option.
2020-10-09T17:49:17.210 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:ceph-fuse[2020-10-09T17:49:17.152+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32
...
2020-10-09T17:49:30.179 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option.
2020-10-09T17:49:30.322 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:29.692+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32
2020-10-09T17:49:30.334 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option.
2020-10-09T17:49:30.363 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:30.185+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32
2020-10-09T17:49:30.547 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option.
2020-10-09T17:49:30.684 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:49:30.340+0000 1d6d3700 -1 client.4454 failed to remount (to trim kernel dentries): return code = 32
2020-10-09T17:49:30.703 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:mount: /home/ubuntu/cephtest/mnt.0: mount point not mounted or bad option.
2020-10-09T17:49:30.755 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6231-ge783bcc4/rpm/el8/BUILD/ceph-16.0.0-6231-ge783bcc4/src/client/Client.cc: In function 'int Client::_do_remount(bool)' thread 1d6d3700 time 2020-10-09T17:49:30.742652+0000
2020-10-09T17:49:30.756 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.0.0-6231-ge783bcc4/rpm/el8/BUILD/ceph-16.0.0-6231-ge783bcc4/src/client/Client.cc: 4315: ceph_abort_msg("abort() called")
2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: ceph version 16.0.0-6231-ge783bcc4 (e783bcc4ed94e312ec724570ede3e64902f921b4) pacific (dev)
2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x52f8b18]
2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 2: (Client::_do_remount(bool)+0x3ce) [0x1a606e]
2020-10-09T17:49:30.777 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 3: (Context::complete(int)+0xd) [0x21127d]
2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 4: (Finisher::finisher_thread_entry()+0x28d) [0x539bb1d]
2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 5: /lib64/libpthread.so.0(+0x82de) [0xef062de]
2020-10-09T17:49:30.778 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 6: clone()
2020-10-09T17:49:30.779 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:*** Caught signal (Aborted) **
2020-10-09T17:49:30.779 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: in thread 1d6d3700 thread_name:fn_anonymous
2020-10-09T17:49:30.815 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: ceph version 16.0.0-6231-ge783bcc4 (e783bcc4ed94e312ec724570ede3e64902f921b4) pacific (dev)
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0xef10dc0]
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 2: gsignal()
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 3: abort()
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x52f8be9]
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 5: (Client::_do_remount(bool)+0x3ce) [0x1a606e]
2020-10-09T17:49:30.816 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 6: (Context::complete(int)+0xd) [0x21127d]
2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 7: (Finisher::finisher_thread_entry()+0x28d) [0x539bb1d]
2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 8: /lib64/libpthread.so.0(+0x82de) [0xef062de]
2020-10-09T17:49:30.817 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr: 9: clone()

From: /ceph/teuthology-archive/pdonnell-2020-10-09_03:07:06-multimds-wip-pdonnell-testing-20201009.012350-distro-basic-smithi/5507591/teuthology.log

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Labels (FS) qa-failure added

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to Triaged
Assignee set to Xiubo Li

Actions

Copy link

Updated by Xiubo Li over 3 years ago

The client in smithi132 was blocklisted:

2020-10-09T17:48:56.320 INFO:tasks.cephfs.fuse_mount.ceph-fuse.0.smithi132.stderr:2020-10-09T17:48:56.199+0000 1a6cd700 -1 client.4454 I was blocklisted at osd epoch 25

Actions

Copy link

Updated by Xiubo Li over 3 years ago

Xiubo Li wrote:

The client in smithi132 was blocklisted:

[...]

The mds.a evicted the client:

2020-10-09T17:48:55.356+0000 1948c700 10 mds.2.log trim 10 / 128 segments, 6551 / -1 events, 0 (0) expiring, 0 (0) expired
2020-10-09T17:48:55.357+0000 1948c700 10 mds.2.log _trim_expired_segments waiting for 1/4194304 to expire
2020-10-09T17:48:55.357+0000 1948c700 10 mds.2.server find_idle_sessions. last cleared laggy state 8042.06s ago
2020-10-09T17:48:55.357+0000 1948c700 20 mds.2.server evicting session client.4454 172.21.15.132:0/2210819714 since autoclose has arrived
2020-10-09T17:48:55.361+0000 1948c700  0 log_channel(cluster) log [WRN] : evicting unresponsive client  (4454), after 300.725 seconds
2020-10-09T17:48:55.361+0000 1948c700 10 mds.2.server autoclosing stale session client.4454 172.21.15.132:0/2210819714 last renewed caps 300.725s ago
2020-10-09T17:48:55.363+0000 1948c700  1 mds.2.10 Evicting (and blocklisting) client session 4454 (172.21.15.132:0/2210819714)
2020-10-09T17:48:55.364+0000 1948c700  0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 4454 (172.21.15.132:0/2210819714)
2020-10-09T17:48:55.364+0000 1948c700  4 mds.2.10 Preparing blocklist command... (wait=0)
2020-10-09T17:48:55.367+0000 1948c700  4 mds.2.10 Sending mon blocklist command: {"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.132:0/2210819714"}
2020-10-09T17:48:55.373+0000 1948c700  1 -- [v2:172.21.15.79:6834/4235987290,v1:172.21.15.79:6835/4235987290] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mon_command({"prefix":"osd blocklist", "blocklistop":"add","addr":"172.21.15.132:0/2210819714"} v 0) v1 -- 0x1a6e8360 con 0x11278740

From the log, it says the client stopped renewing the cap, and after 'session_autoclose = 300' seconds the mds.a closed the sessions.

Actions

Copy link

Updated by Xiubo Li over 3 years ago

From remote/smithi132/log/ceph-client.0.26463.log.gz, the client didn't send any caps renewing request to MDSs more than 300 seconds:

2020-10-09T17:43:54.586+0000 18eca700  2 client.4454 tick
2020-10-09T17:43:54.587+0000 18eca700 10 client.4454 renew_caps()
2020-10-09T17:43:54.587+0000 18eca700 15 client.4454 renew_caps requesting from mds.0
2020-10-09T17:43:54.587+0000 18eca700 10 client.4454 renew_caps mds.0

...

2020-10-09T17:49:05.054+0000 18eca700 10 client.4454 renew_caps()

And then the mds.a daemon detected there is no any renew request more than get_session_autoclose(), which is 300s, then evicted it:

 1088       if (last_cap_renew_span >= mds->mdsmap->get_session_autoclose()) {
 1089         dout(20) << "evicting session " << session->info.inst << " since autoclose " 
 1090                     "has arrived" << dendl; 
 1091         // evict session without marking it stale       
 1092         to_evict.push_back(session);                                                                                                
 1093         continue;
 1094       }

Actions

Copy link