Bug #16142
Exception during internal.connect fails to unlock machines
0%
Description
Symptom is that jobs which have failed still have locked nodes.
The machine was still locked:
{ "is_vm": false, "locked": true, "locked_since": "2016-06-03 00:22:35.009101", "locked_by": "scheduled_yuriw@teuthology", "up": true, "mac_address": null, "name": "smithi055.front.sepia.ceph.com", "os_version": "14.04", "machine_type": "smithi", "vm_host": null, "os_type": "ubuntu", "arch": "x86_64", "ssh_pub_key": "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCcS0/jTSprtfXdi+1HQmxsNIkMGuOTkCjfl7ETuuuGGXBc4aO4C9p4fibGrsxQdtdZ4rZF6q4yzrZeoBC+54f9QimIy+amq612yXWZNelXMKQBNM3gcnZaw1YPdk2zBq0OP/rv3o+WP2CjNpSD3Izev9DVIavDv1S4s9nOfIpJvGE/n93f9tA+pAOXhd7MiPvPXns+rByX4UZmtvpXIsDMOimGo/b9La7asXvjx4eikFz2oCd+1s07dAmvRm0NyttjkNduDD3ewXVbBf8046P6cZOCPVe4tihHug96MwvmEfXw5pDd6AKBIx78bhrUEej/871ybYHLXpiZB130HOPB", "description": "/var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564" },
The worker had gone on to the next task:
2016-06-02T14:29:26.251 INFO:teuthology.worker:Creating archive dir /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564 2016-06-02T14:29:26.252 INFO:teuthology.worker:Running job 230564 2016-06-02T14:29:26.271 DEBUG:teuthology.worker:Running: /var/lib/teuthworker/src/teuthology_master/virtualenv/bin/teuthology -v --lock --block --owner scheduled_yuriw@teuthology --archive /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564 --name yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi --description rados/thrash/{hobj-sort.yaml rados.yaml rocksdb.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/normal_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml msgr/random.yaml msgr-failures/fastclose.yaml thrashers/pggrow.yaml workloads/rados_api_tests.yaml} -- /tmp/teuthology-worker.AOpNa2.tmp 2016-06-02T14:29:26.275 INFO:teuthology.worker:Job archive: /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230564 2016-06-02T14:29:26.276 INFO:teuthology.worker:Job PID: 23645 2016-06-02T14:29:26.276 INFO:teuthology.worker:Running with watchdog 2016-06-02T14:31:26.277 DEBUG:teuthology.worker:Worker log: /var/lib/teuthworker/archive/worker_logs/worker.smithi.29062 2016-06-02T17:23:32.616 ERROR:teuthology.worker:Child exited with code 1 2016-06-02T17:23:32.620 INFO:teuthology.worker:Reserved job 230618 2016-06-02T17:23:32.620 INFO:teuthology.worker:Config is: branch: wip-yuri-testing description: rados/thrash/{hobj-sort.yaml rados.yaml rocksdb.yaml 0-size-min-size-overrides/2-size-2-min-size.yaml 1-pg-log-overrides/short_pg_log.yaml clusters/{fixed-2.yaml openstack.yaml} fs/xfs.yaml msgr/random.yaml msgr-failures/fastclose.yaml thrashers/default.yaml workloads/rgw_snaps.yaml} email: null kernel: {kdb: true, sha1: distro} last_in_suite: false machine_type: smithi name: yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi nuke-on-error: true openstack: - volumes: {count: 3, size: 30} overrides: admin_socket: {branch: wip-yuri-testing} ceph: conf: client: {debug ms: 1, debug rgw: 20} global: {enable experimental unrecoverable data corrupting features: '*', ms inject socket failures: 2500, ms tcp read timeout: 5, ms type: random, osd_max_pg_log_entries: 300, osd_min_pg_log_entries: 150, osd_pool_default_min_size: 2, osd_pool_default_size: 2} mon: {debug mon: 20, debug ms: 1, debug paxos: 20, mon keyvaluedb: rocksdb} osd: {debug filestore: 20, debug journal: 20, debug ms: 1, debug osd: 25, osd debug randomize hobject sort order: true, osd op queue: debug_random, osd op queue cut off: debug_random, osd sloppy crc: true} fs: xfs log-whitelist: [slow request] sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1 ceph-deploy: branch: {dev-commit: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1} conf: client: {log file: /var/log/ceph/ceph-$name.$pid.log} mon: {debug mon: 1, debug ms: 20, debug paxos: 20, osd default pool size: 2} install: ceph: {sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1} workunit: {sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1} owner: scheduled_yuriw@teuthology priority: 100 roles: - [mon.a, mon.c, osd.0, osd.1, osd.2, client.0] - [mon.b, osd.3, osd.4, osd.5, client.1] sha1: 22ad94cd61ee7714da6b3c851967d1b6e44ae6c1 suite: rados suite_branch: master suite_sha1: bd14a8e13b94b7b50ec060e438d8d7096ae78aeb tasks: - {install: null} - ceph: conf: osd: {osd debug reject backfill probability: 0.3, osd max backfills: 1, osd scrub max interval: 120, osd scrub min interval: 60} log-whitelist: [wrongly marked me down, objects unfound and apparently lost] - thrashosds: {chance_pgnum_grow: 1, chance_pgpnum_fix: 1, timeout: 1200} - rgw: {client.0: null, default_idle_timeout: 3600} - thrash_pool_snaps: pools: [.rgw.buckets, .rgw.root, .rgw.control, .rgw, .users.uid, .users.email, .users] - s3readwrite: client.0: readwrite: bucket: rwtest duration: 300 files: {num: 10, size: 2000, stddev: 500} readers: 10 writers: 3 rgw_server: client.0 teuthology_branch: master tube: smithi verbose: false 2016-06-02T17:23:32.652 INFO:teuthology.repo_utils:Fetching from upstream into /var/lib/teuthworker/src/teuthology_master 2016-06-02T17:23:32.696 INFO:teuthology.repo_utils:Resetting repo at /var/lib/teuthworker/src/teuthology_master to branch master 2016-06-02T17:23:32.706 INFO:teuthology.repo_utils:Bootstrapping /var/lib/teuthworker/src/teuthology_master 2016-06-02T17:23:43.105 INFO:teuthology.repo_utils:Bootstrap exited with status 0 2016-06-02T17:23:43.115 INFO:teuthology.repo_utils:Fetching from upstream into /var/lib/teuthworker/src/ceph-qa-suite_master 2016-06-02T17:23:43.206 INFO:teuthology.repo_utils:Resetting repo at /var/lib/teuthworker/src/ceph-qa-suite_master to branch master 2016-06-02T17:23:43.226 INFO:teuthology.worker:Creating archive dir /var/lib/teuthworker/archive/yuriw-2016-06-02_11:43:49-rados-wip-yuri-testing-distro-basic-smithi/230618 2016-06-02T17:23:43.226 INFO:teuthology.worker:Running job 230618
History
#1 Updated by Kefu Chai almost 8 years ago
- Status changed from New to 12
i think it's a regression of paramiko [1]. probably we should stick with paramiko 1.17.0.
---
[1] https://github.com/paramiko/paramiko/issues/751, https://github.com/paramiko/paramiko/issues/750
#2 Updated by Kefu Chai almost 8 years ago
- Status changed from 12 to Fix Under Review
- Assignee set to Kefu Chai
#3 Updated by Vasu Kulkarni almost 8 years ago
This is not an issue with parmaiko, Its a problem that needs to be solved at nuke with whatever retries is necessary to reconnect. I think we should add a simple hourly cronjob
to collect stale nodes and free them for now.
#4 Updated by Kefu Chai almost 8 years ago
#5 Updated by Kefu Chai almost 8 years ago
- Status changed from Fix Under Review to 12
- Assignee deleted (
Kefu Chai)
reassigning from myself. i don't have enough expertise in this area.
#6 Updated by Kefu Chai almost 8 years ago
@Vasu
This is not an issue with parmaiko,
could you elaborate a little bit on this?
#7 Updated by Kefu Chai almost 8 years ago
- Priority changed from Urgent to Immediate
#8 Updated by Zack Cerza almost 8 years ago
I think this is https://github.com/paramiko/paramiko/issues/104
#9 Updated by Zack Cerza almost 8 years ago
The actual exception:
2016-06-04T11:20:26.153 INFO:teuthology.run_tasks:Running task internal.connect... 2016-06-04T11:20:26.183 INFO:teuthology.task.internal:Opening connections... 2016-06-04T11:20:26.183 DEBUG:teuthology.task.internal:connecting to ubuntu@mira038.front.sepia.ceph.com 2016-06-04T11:20:26.428 ERROR:teuthology.run_tasks:Saw exception from tasks. Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 66, in run_tasks manager = run_one_task(taskname, ctx=ctx, config=config) File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 45, in run_one_task return fn(**kwargs) File "/home/teuthworker/src/teuthology_master/teuthology/task/internal.py", line 343, in connect rem.connect() File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/remote.py", line 63, in connect self.ssh = connection.connect(**args) File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/connection.py", line 74, in connect key=_create_key(keytype, key) File "/home/teuthworker/src/teuthology_master/teuthology/orchestra/connection.py", line 33, in create_key return paramiko.rsakey.RSAKey(data=base64.decodestring(key)) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/paramiko/rsakey.py", line 58, in __init__ ).public_key(default_backend()) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/cryptography/hazmat/backends/__init__.py", line 35, in default_backend _default_backend = MultiBackend(_available_backends()) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/cryptography/hazmat/backends/multibackend.py", line 33, in __init__ "Multibackend cannot be initialized with no backends. If you " ValueError: Multibackend cannot be initialized with no backends. If you are seeing this error when trying to use default_backend() please try uninstalling and reinstalling cryptography.
#10 Updated by Zack Cerza almost 8 years ago
- Status changed from 12 to Closed
Hasn't happened in a month http://sentry.ceph.com/sepia/teuthology/issues/913/
#11 Updated by Zack Cerza over 7 years ago
For reference, the sentry event: http://sentry.ceph.com/sepia/teuthology/issues/1341/
#12 Updated by John Spray over 7 years ago
- Status changed from Closed to New
The "Multibackend cannot be initialized with no backends." exception seems to have never really gone away.
http://pulpito.ceph.com/jspray-2016-11-22_02:58:30-fs-wip-jcsp-testing-20161121-distro-basic-smithi
http://pulpito.ceph.com/teuthology-2016-11-21_17:20:02-krbd-master-testing-basic-mira/
http://pulpito.ceph.com/teuthology-2016-11-04_13:00:02-rados-hammer-distro-basic-vps/
http://pulpito.ceph.com/teuthology-2016-11-06_02:01:19-rbd-master-distro-basic-smithi/
...and many more (found by searching my folder of ceph-qa mails).
#13 Updated by Dan Mick over 7 years ago
- Assignee set to Zack Cerza
#14 Updated by Sage Weil over 5 years ago
- Status changed from New to Won't Fix
i think this is moot with fog?