Project

General

Profile

Bug #16142

Exception during internal.connect fails to unlock machines

Added by John Spray almost 8 years ago. Updated over 5 years ago.

Status:
Won't Fix
Priority:
Immediate
Assignee:
Category:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Symptom is that jobs which have failed still have locked nodes.

Here's one:
http://qa-proxy.ceph.com/teuthology/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564/worker.log

The machine was still locked:

    {
        "is_vm": false, 
        "locked": true, 
        "locked_since": "2016-06-03 00:22:35.009101", 
        "locked_by": "scheduled_yuriw@teuthology", 
        "up": true, 
        "mac_address": null, 
        "name": "smithi055.front.sepia.ceph.com", 
        "os_version": "14.04", 
        "machine_type": "smithi", 
        "vm_host": null, 
        "os_type": "ubuntu", 
        "arch": "x86_64", 
        "ssh_pub_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCcS0/jTSprtfXdi+1HQmxsNIkMGuOTkCjfl7ETuuuGGXBc4aO4C9p4fibGrsxQdtdZ4rZF6q4yzrZeoBC+54f9QimIy+amq612yXWZNelXMKQBNM3gcnZaw1YPdk2zBq0OP/rv3o+WP2CjNpSD3Izev9DVIavDv1S4s9nOfIpJvGE/n93f9tA+pAOXhd7MiPvPXns+rByX4UZmtvpXIsDMOimGo/b9La7asXvjx4eikFz2oCd+1s07dAmvRm0NyttjkNduDD3ewXVbBf8046P6cZOCPVe4tihHug96MwvmEfXw5pDd6AKBIx78bhrUEej/871ybYHLXpiZB130HOPB", 
        "description": "/var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564" 
    }, 

The worker had gone on to the next task:

2016-06-02T14:29:26.251 INFO:teuthology.worker:Creating archive dir /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564
2016-06-02T14:29:26.252 INFO:teuthology.worker:Running job 230564
2016-06-02T14:29:26.271 DEBUG:teuthology.worker:Running: /var/lib/teuthworker/src/teuthology_master/virtualenv/bin/teuthology -v --lock --block --owner scheduled_yuriw@teuthology --archive /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564 --name yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi --description rados/thrash/{hobj-sort.yaml rados.yaml rocksdb.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml msgr/random.yaml msgr-failures/fastclose.yaml thrashers/pggrow.yaml workloads/rados_api_tests.yaml} -- /tmp/teuthology-worker.AOpNa2.tmp
2016-06-02T14:29:26.275 INFO:teuthology.worker:Job archive: /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564
2016-06-02T14:29:26.276 INFO:teuthology.worker:Job PID: 23645
2016-06-02T14:29:26.276 INFO:teuthology.worker:Running with watchdog
2016-06-02T14:31:26.277 DEBUG:teuthology.worker:Worker log: /var/lib/teuthworker/archive/worker_logs/worker.smithi.29062
2016-06-02T17:23:32.616 ERROR:teuthology.worker:Child exited with code 1
2016-06-02T17:23:32.620 INFO:teuthology.worker:Reserved job 230618
2016-06-02T17:23:32.620 INFO:teuthology.worker:Config is: branch: wip-yuri-testing
description: rados/thrash/{hobj-sort.yaml rados.yaml rocksdb.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml
  1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml
  msgr/random.yaml msgr-failures/fastclose.yaml thrashers/default.yaml workloads/rgw_snaps.yaml}
email: null
kernel: {kdb: true, sha1: distro}
last_in_suite: false
machine_type: smithi
name: yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi
nuke-on-error: true
openstack:
- volumes: {count: 3, size: 30}
overrides:
  admin_socket: {branch: wip-yuri-testing}
  ceph:
    conf:
      client: {debug ms: 1, debug rgw: 20}
      global: {enable experimental unrecoverable data corrupting features: '*', ms inject socket failures: 2500,
        ms tcp read timeout: 5, ms type: random, osd_max_pg_log_entries: 300, osd_min_pg_log_entries: 150,
        osd_pool_default_min_size: 2, osd_pool_default_size: 2}
      mon: {debug mon: 20, debug ms: 1, debug paxos: 20, mon keyvaluedb: rocksdb}
      osd: {debug filestore: 20, debug journal: 20, debug ms: 1, debug osd: 25, osd debug randomize hobject sort order: true,
        osd op queue: debug_random, osd op queue cut off: debug_random, osd sloppy crc: true}
    fs: xfs
    log-whitelist: [slow request]
    sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1
  ceph-deploy:
    branch: {dev-commit: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1}
    conf:
      client: {log file: /var/log/ceph/ceph-$name.$pid.log}
      mon: {debug mon: 1, debug ms: 20, debug paxos: 20, osd default pool size: 2}
  install:
    ceph: {sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1}
  workunit: {sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1}
owner: scheduled_yuriw@teuthology
priority: 100
roles:
- [mon.a, mon.c, osd.0, osd.1, osd.2, client.0]
- [mon.b, osd.3, osd.4, osd.5, client.1]
sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1
suite: rados
suite_branch: master
suite_sha1: bd14a8e13b94b7b50ec060e438d8d7096ae78aeb
tasks:
- {install: null}
- ceph:
    conf:
      osd: {osd debug reject backfill probability: 0.3, osd max backfills: 1, osd scrub max interval: 120,
        osd scrub min interval: 60}
    log-whitelist: [wrongly marked me down, objects unfound and apparently lost]
- thrashosds: {chance_pgnum_grow: 1, chance_pgpnum_fix: 1, timeout: 1200}
- rgw: {client.0: null, default_idle_timeout: 3600}
- thrash_pool_snaps:
    pools: [.rgw.buckets, .rgw.root, .rgw.control, .rgw, .users.uid, .users.email,
      .users]
- s3readwrite:
    client.0:
      readwrite:
        bucket: rwtest
        duration: 300
        files: {num: 10, size: 2000, stddev: 500}
        readers: 10
        writers: 3
      rgw_server: client.0
teuthology_branch: master
tube: smithi
verbose: false

2016-06-02T17:23:32.652 INFO:teuthology.repo_utils:Fetching from upstream into /var/lib/teuthworker/src/teuthology_master
2016-06-02T17:23:32.696 INFO:teuthology.repo_utils:Resetting repo at /var/lib/teuthworker/src/teuthology_master to branch master
2016-06-02T17:23:32.706 INFO:teuthology.repo_utils:Bootstrapping /var/lib/teuthworker/src/teuthology_master
2016-06-02T17:23:43.105 INFO:teuthology.repo_utils:Bootstrap exited with status 0
2016-06-02T17:23:43.115 INFO:teuthology.repo_utils:Fetching from upstream into /var/lib/teuthworker/src/ceph-qa-suite_master
2016-06-02T17:23:43.206 INFO:teuthology.repo_utils:Resetting repo at /var/lib/teuthworker/src/ceph-qa-suite_master to branch master
2016-06-02T17:23:43.226 INFO:teuthology.worker:Creating archive dir /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230618
2016-06-02T17:23:43.226 INFO:teuthology.worker:Running job 230618

History

#1 Updated by Kefu Chai almost 8 years ago

  • Status changed from New to 12

i think it's a regression of paramiko [1]. probably we should stick with paramiko 1.17.0.

---
[1] https://github.com/paramiko/paramiko/issues/751, https://github.com/paramiko/paramiko/issues/750

#2 Updated by Kefu Chai almost 8 years ago

  • Status changed from 12 to Fix Under Review
  • Assignee set to Kefu Chai

#3 Updated by Vasu Kulkarni almost 8 years ago

This is not an issue with parmaiko, Its a problem that needs to be solved at nuke with whatever retries is necessary to reconnect. I think we should add a simple hourly cronjob
to collect stale nodes and free them for now.

#5 Updated by Kefu Chai almost 8 years ago

  • Status changed from Fix Under Review to 12
  • Assignee deleted (Kefu Chai)

reassigning from myself. i don't have enough expertise in this area.

#6 Updated by Kefu Chai almost 8 years ago

@Vasu

This is not an issue with parmaiko,

could you elaborate a little bit on this?

#7 Updated by Kefu Chai almost 8 years ago

  • Priority changed from Urgent to Immediate

#9 Updated by Zack Cerza almost 8 years ago

The actual exception:

2016-06-04T11:20:26.153 INFO:teuthology.run_tasks:Running task internal.connect...
2016-06-04T11:20:26.183 INFO:teuthology.task.internal:Opening connections...
2016-06-04T11:20:26.183 DEBUG:teuthology.task.internal:connecting to ubuntu@mira038.front.sepia.ceph.com
2016-06-04T11:20:26.428 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 66, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 45, in run_one_task
    return fn(**kwargs)
  File "/home/teuthworker/src/teuthology_master/teuthology/task/internal.py", line 343, in connect
    rem.connect()
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 63, in connect
    self.ssh = connection.connect(**args)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/connection.py", line 74, in connect
    key=_create_key(keytype, key)
  File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/connection.py", line 33, in create_key
    return paramiko.rsakey.RSAKey(data=base64.decodestring(key))
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/rsakey.py", line 58, in __init__
    ).public_key(default_backend())
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/cryptography/hazmat/backends/__init__.py", line 35, in default_backend
    _default_backend = MultiBackend(_available_backends())
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/cryptography/hazmat/backends/multibackend.py", line 33, in __init__
    "Multibackend cannot be initialized with no backends. If you " 
ValueError: Multibackend cannot be initialized with no backends. If you are seeing this error when trying to use default_backend() please try uninstalling and reinstalling cryptography.

#10 Updated by Zack Cerza almost 8 years ago

  • Status changed from 12 to Closed

#11 Updated by Zack Cerza over 7 years ago

#13 Updated by Dan Mick over 7 years ago

  • Assignee set to Zack Cerza

#14 Updated by Sage Weil over 5 years ago

  • Status changed from New to Won't Fix

i think this is moot with fog?

Also available in: Atom PDF