Project

General

Profile

Actions

Bug #4624

closed

crush_ops failure

Added by Samuel Just about 11 years ago. Updated about 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

"name": "b",
"addr": "10.214.132.18:6789\/0"}, { "rank": 1,
"name": "a",
"addr": "10.214.132.21:6789\/0"}, { "rank": 2,
"name": "c",
"addr": "10.214.132.21:6790\/0"}]}}

2013-04-01T13:07:29.538 INFO:teuthology.task.mon_thrash.ceph_manager:quorum is size 2
2013-04-01T13:07:29.539 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise m 10.214.13
2.18:6789 mon_status'
2013-04-01T13:07:29.620 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise -m 10.214.13
2.21:6790 mon_status'
2013-04-01T13:07:29.958 INFO:teuthology.task.mon_thrash.mon_thrasher:waiting for 20.0 secs before reviving monitors
2013-04-01T13:07:49.957 INFO:teuthology.task.mon_thrash.mon_thrasher:reviving mon.a
2013-04-01T13:07:49.958 INFO:teuthology.task.ceph.mon.a:Restarting
2013-04-01T13:07:49.958 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage sudo /home/ubuntu/cephtest/
daemon-helper kill ceph-mon -f -i a'
2013-04-01T13:07:49.963 INFO:teuthology.task.ceph.mon.a:Started
2013-04-01T13:07:49.963 INFO:teuthology.task.mon_thrash.ceph_manager:waiting for quorum size 3
2013-04-01T13:07:49.963 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise quorum_statu
s'
2013-04-01T13:07:49.989 INFO:teuthology.orchestra.run.err:2013-04-01 13:07:36.986502 7fdcd1242700 0 -
:/5036 >> 10.214.132.21:6789/0 pipe(0x1bee220 sd=7 :0 s=1 pgs=0 cs=0 l=1).fault
2013-04-01T13:07:50.221 INFO:teuthology.task.ceph.mon.a.out:starting mon.a rank 1 at 10.214.132.21:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid 287bd577-45b0-440f-b6bc-fa90653a20ec
2013-04-01T13:07:52.996 INFO:teuthology.task.mon_thrash.ceph_manager:quorum_status is { "election_epoch": 16,
"quorum": [
0,
1,
2],
"monmap": { "epoch": 1,
"fsid": "287bd577-45b0-440f-b6bc-fa90653a20ec",
"modified": "2013-04-01 13:05:01.310493",
"created": "2013-04-01 13:05:01.310493",
"mons": [ { "rank": 0,
"name": "b",
"addr": "10.214.132.18:6789\/0"}, { "rank": 1,
"name": "a",
"addr": "10.214.132.21:6789\/0"}, { "rank": 2,
"name": "c",
"addr": "10.214.132.21:6790\/0"}]}}

2013-04-01T13:07:52.996 INFO:teuthology.task.mon_thrash.ceph_manager:quorum is size 3
2013-04-01T13:07:52.996 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise -m 10.214.132.18:6789 mon_status'
2013-04-01T13:07:53.029 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise -m 10.214.132.21:6789 mon_status'
2013-04-01T13:07:53.091 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise -m 10.214.132.21:6790 mon_status'
2013-04-01T13:07:53.314 INFO:teuthology.task.mon_thrash.mon_thrasher:waiting for 1.0 secs before continuing thrashing
2013-04-01T13:07:54.314 INFO:teuthology.task.mon_thrash.ceph_manager:waiting for quorum size 3
2013-04-01T13:07:54.314 DEBUG:teuthology.orchestra.run:Running [10.214.132.21]: '/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --concise quorum_status'
2013-04-01T13:07:54.345 INFO:teuthology.task.mon_thrash.ceph_manager:quorum_status is { "election_epoch": 16,
"quorum": [
0,
1,
2],
"monmap": { "epoch": 1,
"fsid": "287bd577-45b0-440f-b6bc-fa90653a20ec",
"modified": "2013-04-01 13:05:01.310493",
"created": "2013-04-01 13:05:01.310493",
"mons": [ { "rank": 0,
"name": "b",
"addr": "10.214.132.18:6789\/0"}, { "rank": 1,
"name": "a",
"addr": "10.214.132.21:6789\/0"}, { "rank": 2,
"name": "c",
"addr": "10.214.132.21:6790\/0"}]}}

2013-04-01T13:07:54.345 INFO:teuthology.task.mon_thrash.ceph_manager:quorum is size 3
2013-04-01T13:07:54.345 DEBUG:teuthology.run_tasks:Unwinding manager <contextlib.GeneratorContextManager object at 0x26ac290>
2013-04-01T13:07:54.345 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/var/lib/teuthworker/teuthology-master/teuthology/contextutil.py", line 27, in nested
yield vars
File "/var/lib/teuthworker/teuthology-master/teuthology/task/ceph.py", line 1112, in task
yield
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 25, in run_tasks
manager = _run_one_task(taskname, ctx=ctx, config=config)
File "/var/lib/teuthworker/teuthology-master/teuthology/run_tasks.py", line 14, in _run_one_task
return fn(**kwargs)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/workunit.py", line 90, in task
all_spec = True
File "/var/lib/teuthworker/teuthology-master/teuthology/parallel.py", line 83, in exit
for result in self:
File "/var/lib/teuthworker/teuthology-master/teuthology/parallel.py", line 100, in next
resurrect_traceback(result)
File "/var/lib/teuthworker/teuthology-master/teuthology/parallel.py", line 19, in capture_traceback
return func(*args, **kwargs)
File "/var/lib/teuthworker/teuthology-master/teuthology/task/workunit.py", line 302, in _run_tests
args=args,
File "/var/lib/teuthworker/teuthology-master/teuthology/orchestra/remote.py", line 42, in run
r = self._runner(client=self.ssh, **kwargs)
File "/var/lib/teuthworker/teuthology-master/teuthology/orchestra/run.py", line 266, in run
r.exitstatus = _check_status(r.exitstatus)
File "/var/lib/teuthworker/teuthology-master/teuthology/orchestra/run.py", line 262, in _check_status
raise CommandFailedError(command=r.command, exitstatus=status, node=host)
CommandFailedError: Command failed on 10.214.132.18 with status 2: 'mkdir p - /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_REF=master TESTDIR="/home/ubuntu/cephtest" CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/home/ubuntu/cephtest/binary/usr/local/lib/python2.7/dist-packages:/home/ubuntu/cephtest/binary/usr/local/lib/python2.6/dist-packages" /home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage /home/ubuntu/cephtest/workunit.client.0/mon/crush_ops.sh'
2013-04-01T13:07:54.346 INFO:teuthology.task.ceph:Shutting down mds daemons...
2013-04-01T13:07:54.347 DEBUG:teuthology.task.ceph.mds.a:waiting for process to exit
2013-04-01T13:07:54.355 INFO:teuthology.task.ceph.mds.a:Stopped
2013-04-01T13:07:54.356 INFO:teuthology.task.ceph:Shutting down osd daemons...
2013-04-01T13:07:54.356 DEBUG:teuthology.task.ceph.osd.1:waiting for process to exit

description: collection:monthrash clusters:fixed-2.yaml fs:xfs.yaml msgr-failures:few.yaml
thrashers:mon-thrasher.yaml workloads:rados_mon_workunits.yaml
duration: 521.2649898529053
failure_reason: 'Command failed on 10.214.132.18 with status 2: ''mkdir p - /home/ubuntu/cephtest/mnt.0/client.0/tmp
&& cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_REF=master TESTDIR="/home/ubuntu/cephtest"
CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/home/ubuntu/cephtest/binary/usr/local/lib/python2.7/dist-packages:/home/ubuntu/cephtest/binary/usr/local/lib/python2.6/dist-packages"
/home/ubuntu/cephtest/enable-coredump ceph-coverage /home/ubuntu/cephtest/archive/coverage
/home/ubuntu/cephtest/workunit.client.0/mon/crush_ops.sh'''
flavor: basic
mon.a-kernel-sha1: ddcb7527662504b8f676df8950527218ce109680
mon.b-kernel-sha1: ddcb7527662504b8f676df8950527218ce109680
owner: scheduled_teuthology@teuthology
success: false

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2013-04-01_12:48:27-rados-master-testing-basic/7631

Actions #1

Updated by Sage Weil about 11 years ago

I think the problem here is that many/most of the crush ops aren't framed to be idempotent.. they do things like return EEXIST/ENOENT instead of being a noop.

Actions #2

Updated by Sage Weil about 11 years ago

  • Assignee changed from Joao Eduardo Luis to Sage Weil
Actions #3

Updated by Sage Weil about 11 years ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by Sage Weil about 11 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF