Project

General

Profile

Bug #36348

luminous(?): blogbench I/O with two kernel clients; one stalls

Added by Patrick Donnelly 2 months ago. Updated about 2 months ago.

Status:
Need More Info
Priority:
Urgent
Assignee:
Category:
-
Target version:
Start date:
10/08/2018
Due date:
% Done:

0%

Source:
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
Pull request ID:

Description

2018-10-06T01:41:36.157 INFO:tasks.workunit.client.0.smithi186.stdout:        57        45766           253         29521           214         19909           557
2018-10-06T01:41:36.158 INFO:tasks.workunit.client.1.smithi186.stdout:        74       282056           227        177601           192        113251           512
2018-10-06T01:41:46.158 INFO:tasks.workunit.client.0.smithi186.stdout:        61       177403           175        123272            72         70865           284
2018-10-06T01:41:46.158 INFO:tasks.workunit.client.1.smithi186.stdout:        75       134875           150         84465            79         55533           340
2018-10-06T01:41:56.158 INFO:tasks.workunit.client.0.smithi186.stdout:        63       246237           113        168618           263        104781           280
2018-10-06T01:41:56.158 INFO:tasks.workunit.client.1.smithi186.stdout:        77       155939            66        101268            56         67137           242
2018-10-06T01:42:06.158 INFO:tasks.workunit.client.0.smithi186.stdout:        67       192669            70        133553           156         79305           340
2018-10-06T01:42:06.158 INFO:tasks.workunit.client.1.smithi186.stdout:        77       195926           128        125944            70         81732           192
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:Final score for writes:            67
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:Final score for reads :         66275
2018-10-06T01:44:51.724 INFO:tasks.workunit.client.0.smithi186.stdout:
2018-10-06T01:44:51.725 INFO:teuthology.orchestra.run:Running command with timeout 900
2018-10-06T01:44:51.726 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/mnt.0/client.0/tmp'
2018-10-06T01:45:39.523 INFO:tasks.workunit:Stopping ['suites/blogbench.sh'] on client.0...
2018-10-06T01:45:39.524 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0'
2018-10-06T01:45:39.818 DEBUG:teuthology.parallel:result is None
2018-10-06T04:36:42.547 DEBUG:teuthology.orchestra.run:got remote process result: 124
2018-10-06T04:36:42.583 INFO:tasks.workunit:Stopping ['suites/blogbench.sh'] on client.1...
2018-10-06T04:36:42.583 INFO:teuthology.orchestra.run.smithi186:Running: 'sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.1 /home/ubuntu/cephtest/clone.client.1'
2018-10-06T04:36:42.883 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 86, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/run_tasks.py", line 65, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 206, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 356, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 85, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 99, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/parallel.py", line 22, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri4-testing-2018-10-05-2015-luminous/qa/tasks/workunit.py", line 479, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 429, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed (workunit test suites/blogbench.sh) on smithi186 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && cd -- /home/ubuntu/cephtest/mnt.1/client.1/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-yuri4-testing-2018-10-05-2015-luminous TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="1" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.1 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.1 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.1/qa/workunits/suites/blogbench.sh'

From: /ceph/teuthology-archive/yuriw-2018-10-05_22:19:38-multimds-wip-yuri4-testing-2018-10-05-2015-luminous-testing-basic-smithi/3105596/teuthology.log

Last client_request for the stalled client was:

2018-10-06 01:43:21.571868 7f8f4d47f700  1 -- 172.21.15.195:6816/1919957959 <== client.4259 172.21.15.186:0/2801504243 124457 ==== client_request(client.4259:315376 create #0x30000001b59/comment-1.xml.tmp 2018-10-06 01:43:21.569321 caller_uid=1000, caller_gid=1190{}) v2 ==== 139+0+0 (1101040945 0 0) 0x55eef482a3c0 con 0x55eef4804800

Only interesting thing I saw later was that rank 0 imports caps for that client:

2018-10-06 01:43:21.615523 7f8f47473700 10 mds.0.server finish_force_open_sessions: final v 5650
2018-10-06 01:43:21.615525 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x30000000afd [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/picture-5.jpg auth{2=1} v833 dirtyparent s=58231 n(v0 rc2018-10-06 01:39:28.420308 b58231 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5e000]
2018-10-06 01:43:21.615532 7f8f47473700 10 mds.0.cache.ino(0x30000000afd) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615537 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x30000000957 [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-57.xml auth{2=1} v961 dirtyparent s=2759 n(v0 rc2018-10-06 01:39:17.216308 b2759 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5e700]
2018-10-06 01:43:21.615541 7f8f47473700 10 mds.0.cache.ino(0x30000000957) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615543 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x300000013cb [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-58.xml auth{2=1} v963 dirtyparent s=3511 n(v0 rc2018-10-06 01:40:12.436307 b3511 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5ee00]
2018-10-06 01:43:21.615546 7f8f47473700 10 mds.0.cache.ino(0x300000013cb) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)
2018-10-06 01:43:21.615548 7f8f47473700 10 mds.0.migrator finish_import_inode_caps for client.4259 on [inode 0x3000000096e [2,head] /client.1/tmp/blogbench-1.0/src/blogtest_in/blog-34/article-53.xml auth{2=1} v951 dirtyparent s=16116 n(v0 rc2018-10-06 01:39:18.128308 b16116 1=1+0) (iversion lock) | importingcaps=1 dirtyparent=1 replicated=1 dirty=1 0x55eef4c5f500]
2018-10-06 01:43:21.615552 7f8f47473700 10 mds.0.cache.ino(0x3000000096e) add_client_cap first cap, joining realm snaprealm(0x1 seq 1 lc 0 cr 0 cps 1 snaps={} 0x55eef4606a00)

Related issues

Duplicated by fs - Bug #36366: luminous: qa: blogbench hang with two kclients and 3 active mds Duplicate 10/09/2018

History

#1 Updated by Zheng Yan 2 months ago

  • Status changed from New to Duplicate

#2 Updated by Patrick Donnelly 2 months ago

  • Duplicated by Bug #36366: luminous: qa: blogbench hang with two kclients and 3 active mds added

#3 Updated by Patrick Donnelly 2 months ago

  • Status changed from Duplicate to New

Reversing duplicate since this is older. Can't believe I forgot I already opened an issue for this.

#4 Updated by Patrick Donnelly about 2 months ago

  • Status changed from New to Need More Info

Need logs.

Also available in: Atom PDF