Project

General

Profile

Bug #36078

mds: 9 active MDS cluster stuck during fsstress

Added by Patrick Donnelly 3 months ago. Updated 2 months ago.

Status:
Need More Info
Priority:
Immediate
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
multimds
Pull request ID:

Description

2018-09-18T23:17:57.267 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.1...
2018-09-18T23:17:57.267 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2018-09-18T23:17:57.267 INFO:teuthology.orchestra.run.smithi205:Running (workunit test suites/fsstress.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20180918.221315 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/fsstress.sh'
2018-09-18T23:17:57.316 INFO:tasks.workunit:Running workunits matching suites/fsstress.sh on client.0...
2018-09-18T23:17:57.316 INFO:tasks.workunit:Running workunit suites/fsstress.sh...
2018-09-18T23:17:57.316 INFO:teuthology.orchestra.run.smithi205:Running (workunit test suites/fsstress.sh): 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20180918.221315 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/fsstress.sh'
2018-09-18T23:17:57.920 INFO:tasks.workunit.client.0.smithi205.stderr:+ mkdir -p fsstress
2018-09-18T23:17:57.935 INFO:tasks.workunit.client.1.smithi205.stderr:+ mkdir -p fsstress
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stderr:+ pushd fsstress
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2018-09-18T23:17:57.977 INFO:tasks.workunit.client.0.smithi205.stdout:~/cephtest/mnt.0/client.0/tmp/fsstress ~/cephtest/mnt.0/client.0/tmp
2018-09-18T23:17:58.020 INFO:tasks.workunit.client.1.smithi205.stderr:+ pushd fsstress
2018-09-18T23:17:58.021 INFO:tasks.workunit.client.1.smithi205.stderr:+ wget -q -O ltp-full.tgz http://download.ceph.com/qa/ltp-full-20091231.tgz
2018-09-18T23:17:58.021 INFO:tasks.workunit.client.1.smithi205.stdout:~/cephtest/mnt.1/client.1/tmp/fsstress ~/cephtest/mnt.1/client.1/tmp
2018-09-18T23:17:59.370 INFO:tasks.workunit.client.1.smithi205.stderr:+ tar xzf ltp-full.tgz
2018-09-18T23:18:01.751 INFO:tasks.workunit.client.0.smithi205.stderr:+ tar xzf ltp-full.tgz
2018-09-18T23:18:06.681 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:06.685 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:06.688 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:12.346 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.214 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.368 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.401 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:12.402 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:12.215 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.258 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.875 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.259 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.875 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.259 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.876 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:33.260 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:32.876 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:36.856 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:36.860 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:36.863 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:18:43.778 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.786 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.791 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:43.811 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:43.584 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.783 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.812 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.814 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.814 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:18:54.815 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:18:54.640 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.753 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:04.771 INFO:tasks.ceph.mds.a.smithi173.stderr:2018-09-18 23:19:04.754 18b10700 -1 mds.0.bal  balancer runs too long
2018-09-18T23:19:07.025 INFO:teuthology.orchestra.run.smithi150:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:19:07.030 INFO:teuthology.orchestra.run.smithi173:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
2018-09-18T23:19:07.034 INFO:teuthology.orchestra.run.smithi205:Running: 'sudo logrotate /etc/logrotate.d/ceph-test.conf'
...

From: /ceph/teuthology-archive/pdonnell-2018-09-18_22:20:57-multimds:verify-master-distro-basic-smithi/3039349/teuthology.log

Note! I killed fsstress/rm once it was clear they were stuck.

This is with fsstress modified to work properly with parallel instances: https://github.com/ceph/ceph/pull/24157

History

#1 Updated by Zheng Yan 3 months ago

  • Status changed from New to Need More Info

I don't see any stuck client request in any mds. The "balancer runs too long" warning is because the mds ran too slowly (ether machine is too slow or debug level is too high). I added code to limit balancer to run 0.1 second at most. But the log show that mds took 0.01 second to check popularity of a 50 entries directory.

#2 Updated by Patrick Donnelly 2 months ago

Also in Luminous: /ceph/teuthology-archive/yuriw-2018-10-05_22:19:38-multimds-wip-yuri4-testing-2018-10-05-2015-luminous-testing-basic-smithi/3105585/teuthology.log

Also available in: Atom PDF