Project

General

Profile

Bug #18575

Queued job fails to start then kills worker

Added by David Galloway about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I noticed the number of workers drop on the Grafana dashboard. I correlated the time the worker count decreased to the mtime of this worker log.

http://pulpito.ceph.com/jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi/722896/

root@teuthology:/ceph/teuthology-archive/worker_logs# cat worker.smithi.11960

2017-01-17T00:31:44.649 INFO:root:teuthology version: 1.0.0-ed1a1e9
2017-01-17T00:31:44.714 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_git_teuthology_master was just updated; assuming it is current
2017-01-17T00:31:44.714 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_git_teuthology_master to branch master
2017-01-17T00:31:44.746 INFO:teuthology.repo_utils:Skipping bootstrap as it was already done in the last 60s
2017-01-17T00:31:49.541 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_ceph_master was just updated; assuming it is current
2017-01-17T00:31:49.541 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_ceph_master to branch master
2017-01-17T00:31:49.694 INFO:teuthology.worker:Reserved job 722896
2017-01-17T00:31:49.694 INFO:teuthology.worker:Config is: branch: wip-jd-testing
description: rbd/nbd/{base/install.yaml cluster/{fixed-3.yaml openstack.yaml} fs/xfs.yaml
  msgr-failures/few.yaml objectstore/bluestore.yaml thrashers/default.yaml workloads/rbd_nbd.yaml}
email: dillaman@redhat.com
kernel: {kdb: true, sha1: distro}
last_in_suite: false
machine_type: smithi
name: jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi
nuke-on-error: true
openstack:
- machine: {cpus: 1, disk: 40, ram: 8000}
  volumes: {count: 3, size: 30}
os_type: ubuntu
overrides:
  admin_socket: {branch: wip-jd-testing}
  ceph:
    conf:
      global: {ms inject socket failures: 5000}
      mon: {debug mon: 20, debug ms: 1, debug paxos: 20}
      osd: {bluestore block size: 96636764160, debug bdev: 20, debug bluefs: 20, debug bluestore: 30,
        debug filestore: 20, debug journal: 20, debug ms: 1, debug osd: 25, debug rocksdb: 10,
        enable experimental unrecoverable data corrupting features: '*', osd debug randomize hobject sort order: false,
        osd objectstore: bluestore, osd sloppy crc: true}
    fs: xfs
    log-whitelist: [slow request, wrongly marked me down, objects unfound and apparently
        lost]
    sha1: 7c4bd464baf7dbd4f1f1c6bdfee0bae664479727
  ceph-deploy:
    conf:
      client: {log file: /var/log/ceph/ceph-$name.$pid.log}
      mon: {debug mon: 1, debug ms: 20, debug paxos: 20, osd default pool size: 2}
  install:
    ceph:
      extra_packages: [rbd-nbd]
      sha1: 7c4bd464baf7dbd4f1f1c6bdfee0bae664479727
  thrashosds: {bdev_inject_crash: 2, bdev_inject_crash_probability: 0.5}
  workunit: {sha1: 7c4bd464baf7dbd4f1f1c6bdfee0bae664479727}
owner: jdillaman
priority: 100
repo: git://git.ceph.com/ceph-ci.git
roles:
- [mon.a, mon.c, osd.0, osd.1, osd.2]
- [mon.b, osd.3, osd.4, osd.5]
- [client.0]
sha1: 7c4bd464baf7dbd4f1f1c6bdfee0bae664479727
suite: rbd
suite_branch: wip-jd-testing
suite_relpath: qa
suite_repo: git://git.ceph.com/ceph-ci.git
suite_sha1: 7c4bd464baf7dbd4f1f1c6bdfee0bae664479727
tasks:
- {install: null}
- {ceph: null}
- thrashosds: {timeout: 1200}
- workunit:
    clients:
      client.0: [rbd/rbd-nbd.sh]
teuthology_branch: master
tube: smithi
verbose: false

2017-01-17T00:31:49.724 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_git_teuthology_master was just updated; assuming it is current
2017-01-17T00:31:49.724 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_git_teuthology_master to branch master
2017-01-17T00:31:49.747 INFO:teuthology.repo_utils:Skipping bootstrap as it was already done in the last 60s
2017-01-17T00:31:55.440 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_ceph-c_wip-jd-testing was just updated; assuming it is current
2017-01-17T00:31:55.440 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_ceph-c_wip-jd-testing to branch wip-jd-testing
2017-01-17T00:31:55.469 INFO:teuthology.worker:Creating archive dir /home/teuthworker/archive/jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi/722896
2017-01-17T00:31:56.438 INFO:teuthology.worker:Running job 722896
2017-01-17T00:31:56.461 DEBUG:teuthology.worker:Running: /home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/bin/teuthology -v --lock --block --owner jdillaman --archive /home/teuthworker/archive/jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi/722896 --name jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi --description rbd/nbd/{base/install.yaml cluster/{fixed-3.yaml openstack.yaml} fs/xfs.yaml msgr-failures/few.yaml objectstore/bluestore.yaml thrashers/default.yaml workloads/rbd_nbd.yaml} -- /tmp/teuthology-worker.gJQo_L.tmp
2017-01-17T00:31:56.510 INFO:teuthology.worker:Job archive: /home/teuthworker/archive/jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi/722896
2017-01-17T00:31:56.510 INFO:teuthology.worker:Job PID: 14637
2017-01-17T00:31:56.510 INFO:teuthology.worker:Running with watchdog
2017-01-17T00:33:56.519 DEBUG:teuthology.worker:Worker log: /home/teuthworker/archive/worker_logs/worker.smithi.11960
2017-01-17T12:32:18.964 WARNING:teuthology.worker:Job ran longer than 43200s. Killing...
2017-01-17T12:32:19.022 ERROR:teuthology.worker:run_with_watchdog had an unhandled exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 279, in run_job
    run_with_watchdog(p, job_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 319, in run_with_watchdog
    teuth_config.archive_base)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/kill.py", line 71, in kill_job
    "I could not figure out the owner of the requested job. " 
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.
2017-01-17T12:32:19.082 CRITICAL:teuthology.worker:Uncaught exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/virtualenv/bin/teuthology-worker", line 11, in <module>
    load_entry_point('teuthology', 'console_scripts', 'teuthology-worker')()
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/scripts/worker.py", line 7, in main
    teuthology.worker.main(parse_args())
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 139, in main
    ctx.verbose,
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 279, in run_job
    run_with_watchdog(p, job_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 319, in run_with_watchdog
    teuth_config.archive_base)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/kill.py", line 71, in kill_job
    "I could not figure out the owner of the requested job. " 
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.

History

#1 Updated by David Galloway about 7 years ago

I touched /home/teuthworker/archive/jdillaman-2017-01-16_15:34:13-rbd-wip-jd-testing-distro-basic-smithi/722896/.preserve to keep the dir.

#2 Updated by David Galloway about 7 years ago

May be a residual failure from this problem: http://tracker.ceph.com/issues/18482

Will leave this ticket open for a day or two to see if more workers die.

#3 Updated by David Galloway about 7 years ago

Found some more

teuthology-2017-01-16_11:00:03-rbd-kraken-distro-basic-smithi/721193

2017-01-18T19:35:33.385 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_git_teuthology_master was just updated; assuming it is current
2017-01-18T19:35:33.386 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_git_teuthology_master to branch master
2017-01-18T19:35:33.426 INFO:teuthology.repo_utils:Skipping bootstrap as it was already done in the last 60s
2017-01-18T19:35:36.786 INFO:teuthology.repo_utils:/home/teuthworker/src/git.ceph.com_ceph_kraken was just updated; assuming it is current
2017-01-18T19:35:36.786 INFO:teuthology.repo_utils:Resetting repo at /home/teuthworker/src/git.ceph.com_ceph_kraken to branch kraken
2017-01-18T19:35:36.900 INFO:teuthology.worker:Creating archive dir /home/teuthworker/archive/teuthology-2017-01-16_11:00:03-rbd-kraken-distro-basic-smithi/721193
2017-01-18T19:35:36.901 INFO:teuthology.worker:Running job 721193
2017-01-18T19:35:36.931 DEBUG:teuthology.worker:Running: /home/teuthworker/src/git.ceph.com_git_teuthology_master/virtualenv/bin/teuthology -v --lock --block --owner scheduled_teuthology@teuthology --archive /home/teuthworker/archive/teuthology-2017-01-16_11:00:03-rbd-kraken-distro-basic-smithi/721193 --name teuthology-2017-01-16_11:00:03-rbd-kraken-distro-basic-smithi --description rbd/librbd/{cache/writeback.yaml clusters/{fixed-3.yaml openstack.yaml} config/none.yaml fs/xfs.yaml msgr-failures/few.yaml objectstore/filestore.yaml pool/replicated-data-pool.yaml workloads/c_api_tests.yaml} -- /tmp/teuthology-worker.v8AWpK.tmp
2017-01-18T19:35:36.966 INFO:teuthology.worker:Job archive: /home/teuthworker/archive/teuthology-2017-01-16_11:00:03-rbd-kraken-distro-basic-smithi/721193
2017-01-18T19:35:36.967 INFO:teuthology.worker:Job PID: 29519
2017-01-18T19:35:36.968 INFO:teuthology.worker:Running with watchdog
2017-01-18T19:37:36.969 DEBUG:teuthology.worker:Worker log: /home/teuthworker/archive/worker_logs/worker.smithi.13184
2017-01-19T07:35:53.441 WARNING:teuthology.worker:Job ran longer than 43200s. Killing...
2017-01-19T07:35:53.635 ERROR:teuthology.worker:run_with_watchdog had an unhandled exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 279, in run_job
    run_with_watchdog(p, job_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 319, in run_with_watchdog
    teuth_config.archive_base)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/kill.py", line 71, in kill_job
    "I could not figure out the owner of the requested job. " 
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.
2017-01-19T07:35:53.903 CRITICAL:teuthology.worker:Uncaught exception
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/virtualenv/bin/teuthology-worker", line 11, in <module>
    load_entry_point('teuthology', 'console_scripts', 'teuthology-worker')()
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/scripts/worker.py", line 7, in main
    teuthology.worker.main(parse_args())
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 139, in main
    ctx.verbose,
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 279, in run_job
    run_with_watchdog(p, job_config)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/worker.py", line 319, in run_with_watchdog
    teuth_config.archive_base)
  File "/home/teuthworker/src/git.ceph.com_teuthology_master/teuthology/kill.py", line 71, in kill_job
    "I could not figure out the owner of the requested job. " 
RuntimeError: I could not figure out the owner of the requested job. Please pass --owner <owner>.

jdillaman-2017-01-18_08:34:20-rbd-wip-jd-testing-distro-basic-smithi/728120

#4 Updated by Zack Cerza about 7 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF