Bug #36078: mds: 9 active MDS cluster stuck during fsstress - CephFS - Ceph

Actions

Copy link

Bug #36078

closed

mds: 9 active MDS cluster stuck during fsstress

Added by Patrick Donnelly over 5 years ago. Updated about 4 years ago.

Status:

Can't reproduce

Priority:

Immediate

Assignee:

Zheng Yan

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2018-09-18T23:17:57.267 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.1...
2018-09-18T23:17:57.267 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2018-09-18T23:17:57.267 INFO:teuthology.orchestra.run.smithi205:Running (workunit test suites/fsstress.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20180918.221315 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh'
2018-09-18T23:17:57.316 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.0...
2018-09-18T23:17:57.316 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2018-09-18T23:17:57.316 INFO:teuthology.orchestra.run.smithi205:Running (workunit test suites/fsstress.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20180918.221315 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh'
2018-09-18T23:17:57.920 INFO:tasks.workunit.client.0.smithi205.stderr:+ mkdir -p fsstress
2018-09-18T23:17:57.935 INFO:tasks.workunit.client.1.smithi205.stderr:+ mkdir -p fsstress
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stderr:+ pushd fsstress
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stdout:~/cephtest/mnt.0/client.0/tmp/fsstress ~/cephtest/mnt.0/client.0/tmp
2018-09-18T23:17:58.020 INFO:tasks.workunit.client.1.smithi205.stderr:+ pushd fsstress
2018-09-18T23:17:58.021 INFO:tasks.workunit.client.1.smithi205.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2018-09-18T23:17:58.021 INFO:tasks.workunit.client.1.smithi205.stdout:~/cephtest/mnt.1/client.1/tmp/fsstress ~/cephtest/mnt.1/client.1/tmp
2018-09-18T23:17:59.370 INFO:tasks.workunit.client.1.smithi205.stderr:+ tar xzf ltp-full.tgz
2018-09-18T23:18:01.751 INFO:tasks.workunit.client.0.smithi205.stderr:+ tar xzf ltp-full.tgz
2018-09-18T23:18:06.681 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:06.685 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:06.688 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:12.346 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.214 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.368 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.401 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.402 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.258 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.875 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.259 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.875 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.259 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.876 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.260 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.876 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:36.856 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:36.860 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:36.863 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:43.778 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.786 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.791 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.811 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.783 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.812 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.814 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.814 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.815 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.771 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:07.025 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:19:07.030 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:19:07.034 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
...

From: /ceph/teuthology-archive/pdonnell-2018-09-18_22:20:57-multimds:verify-master-distro-basic-smithi/3039349/teuthology.log

Note! I killed fsstress/rm once it was clear they were stuck.

This is with fsstress modified to work properly with parallel instances: https://github.com/ceph/ceph/pull/24157

Actions

Copy link

Updated by Zheng Yan over 5 years ago

Status changed from New to Need More Info

I don't see any stuck client request in any mds. The "balancer runs too long" warning is because the mds ran too slowly (ether machine is too slow or debug level is too high). I added code to limit balancer to run 0.1 second at most. But the log show that mds took 0.01 second to check popularity of a 50 entries directory.

Actions

Copy link

Updated by Patrick Donnelly over 5 years ago

Also in Luminous: /ceph/teuthology-archive/yuriw-2018-10-05_22:19:38-multimds-wip-yuri4-testing-2018-10-05-2015-luminous-testing-basic-smithi/3105585/teuthology.log

Actions

Copy link