Bug #13879: ovh: hadoop wordcount is timing out on OpenStack nodes - CephFS - Ceph

Actions

Copy link

Bug #13879

open

ovh: hadoop wordcount is timing out on OpenStack nodes

Added by Greg Farnum over 8 years ago. Updated over 7 years ago.

Status:

New

Priority:

Low

Assignee:

Category:

Testing

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We don't really have any logging, and I think this runs pretty fast on sepia. We've got some other issues with MPI not being able to communicate that makes me wonder if we're trying to use ports that are blocked, although the other job in the suite is passing.

2015-11-14T18:44:58.402 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:58 INFO client.RMProxy: Connecting to ResourceManager at /158.69.71.28:8050
2015-11-14T18:44:59.092 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:58 INFO ceph.CephFileSystem: selectDataPool path=/tmp/hadoop-yarn/staging/ubuntu/.staging/job_1447526634752_0001/job.jar pool:repl=data:2 wanted=3
2015-11-14T18:44:59.203 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO input.FileInputFormat: Total input paths to process : 3
2015-11-14T18:44:59.225 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO ceph.CephFileSystem: selectDataPool path=/tmp/hadoop-yarn/staging/ubuntu/.staging/job_1447526634752_0001/job.split pool:repl=data:2 wanted=3
2015-11-14T18:44:59.249 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO ceph.CephFileSystem: selectDataPool path=/tmp/hadoop-yarn/staging/ubuntu/.staging/job_1447526634752_0001/job.splitmetainfo pool:repl=data:2 wanted=3
2015-11-14T18:44:59.266 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO mapreduce.JobSubmitter: number of splits:3
2015-11-14T18:44:59.274 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO ceph.CephFileSystem: selectDataPool path=/tmp/hadoop-yarn/staging/ubuntu/.staging/job_1447526634752_0001/job.xml pool:repl=data:2 wanted=3
2015-11-14T18:44:59.525 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:44:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1447526634752_0001
2015-11-14T18:45:00.863 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:45:00 INFO impl.YarnClientImpl: Submitted application application_1447526634752_0001
2015-11-14T18:45:00.907 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:45:00 INFO mapreduce.Job: The url to track the job: http://target071028.ovh.sepia.ceph.com:8088/proxy/application_1447526634752_0001/
2015-11-14T18:45:00.908 INFO:tasks.workunit.client.0.target071029.stderr:15/11/14 18:45:00 INFO mapreduce.Job: Running job: job_1447526634752_0001
2015-11-14T21:44:47.105 INFO:tasks.workunit:Stopping ['hadoop/wordcount.sh'] on client.0...
2015-11-14T21:44:47.106 INFO:teuthology.orchestra.run.target071029:Running: 'rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/workunit.client.0'
2015-11-14T21:44:47.410 ERROR:teuthology.parallel:Exception in parallel execution
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 82, in __exit__
    for result in self:
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 101, in next
    resurrect_traceback(result)
  File "/home/teuthworker/src/teuthology_master/teuthology/parallel.py", line 19, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/ceph-qa-suite_hammer/tasks/workunit.py", line 361, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 156, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 378, in run
    r.wait()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/run.py", line 114, in wait
    label=self.label)
CommandFailedError: Command failed (workunit test hadoop/wordcount.sh) on target071029 with status 124: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=a79acd41187e6b049432bdc314f192e3fbb560a3 TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PATH=$PATH:/usr/sbin adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/workunit.client.0/hadoop/wordcount.sh'

Actions

Copy link

Updated by John Spray over 7 years ago

Subject changed from qa: hadoop wordcount is timing out on OpenStack nodes to ovh: hadoop wordcount is timing out on OpenStack nodes
Priority changed from Normal to Low

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #13879

ovh: hadoop wordcount is timing out on OpenStack nodes

Updated by John Spray over 7 years ago