Actions
Bug #14518
closedBad interaction between deadlocking teuthology processes
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
Sometimes (maybe after disk failures?) pulpito still shows runs that had been killed.
As of today see "greg-fs-speculative-119" in http://pulpito.ceph.com/?status=running
This run is actually dead.
Not sure what would be a possible solution here, maybe a way in pulpito UI to clean up those, Zack?
Updated by Zack Cerza over 8 years ago
- Project changed from pulpito to teuthology
- Subject changed from pulpito shows killed runs to Bad interaction between deadlocking teuthology processes
- Priority changed from Normal to High
Not pulpito. The worker broke:
2016-01-20T16:56:40.196 INFO:teuthology.worker:Running job 34218 2016-01-20T16:56:40.220 INFO:teuthology.worker:Job archive: /var/lib/teuthworker/archive/gregf-2016-01-19_23:14:25-fs-greg-fs-speculative-119---basic-mira/34218 2016-01-20T16:56:40.221 INFO:teuthology.worker:Job PID: 18877 2016-01-20T16:56:40.221 INFO:teuthology.worker:Running with watchdog 2016-01-20T16:58:40.222 DEBUG:teuthology.worker:Worker log: /var/lib/teuthworker/archive/worker_logs/worker.mira.23940 2016-01-22T04:57:56.450 WARNING:teuthology.worker:Job ran longer than 129600s. Killing... 2016-01-22T04:57:56.822 INFO:teuthology.kill:Killing Pids: set([18877]) 2016-01-22T04:57:56.832 INFO:teuthology.kill:Nuking machines: ['mira078', 'mira115'] 2016-01-22T04:57:57.238 INFO:teuthology.kill:2016-01-22 04:57:57,237.237 INFO:teuthology.nuke:targets: 2016-01-22T04:57:57.239 INFO:teuthology.kill: ubuntu@mira078.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbfD/mcGMuU/3lVzxg2tF82DDXuCUemKjVt0TBXMaRA+qofkwh/lJRx/9RD lwMf/ZueSFWDdVa4xLGhXqNjweJo1N1nuvlqdAua4HReEVnkQSZI2Ox6uZIyIo23ZVrjysItgTkZT05I61lIp3nXai7xJid9bB4rba3ru+Js/0xyZYk3vCrQfIM85HVFRXn1eOODKFKTQYZvgyr5i9ar+nTAL3Wy4hFpBdSJg5oZrmXmqYjbE 2huFQW1RYKBB1CEuQE2fL9F6YwMnTX2XcRzaaDYkW/TLcabcqeOGmPAyOkXWp1g59l5iCs4dtyiwjuYXOBSOiiosQWAvGRb/tJXWkZ 2016-01-22T04:57:57.239 INFO:teuthology.kill: ubuntu@mira115.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOEBlMt8gBzs9rDmsi9oBMTrAouaeAFqQjHmlOaIPqcKTN5pJvtR70pAyoi myNiMuBOPRVqycJIL+jXdhcZlNz+uB96X6+MFhsjV5WzC+UhWEbUra95m12hm/LBp+QsudtkykngynMuspMfM7Ullod5RUMHvTdsNQVQxUIsdwJmsPBpBLKHN0//T9Cokmv6yvT+qQk5mRkBCg3MocS8CPEfXSygdj1uKX4V8XwZJ+P627vjNJP5KtJi0f6nU1fmlEvHP6OSz/wN0Ks27JYo8YOaQaF/EQBMYQWUF/MzaM3HqCxXnpqpbAD/60bUBfayTbJe8w7YFuORCgOVoPMtNt1 2016-01-22T04:57:57.284 INFO:teuthology.kill:2016-01-22 04:57:57,283.283 INFO:teuthology.nuke:checking console status of mira115.ipmi.sepia.ceph.com 2016-01-22T04:57:57.292 INFO:teuthology.kill:2016-01-22 04:57:57,291.291 INFO:teuthology.nuke:checking console status of mira078.ipmi.sepia.ceph.com 2016-01-22T04:57:57.591 INFO:teuthology.kill:2016-01-22 04:57:57,590.590 INFO:teuthology.nuke:console ready on mira115.ipmi.sepia.ceph.com 2016-01-22T04:57:57.591 INFO:teuthology.kill:2016-01-22 04:57:57,591.591 INFO:teuthology.task.internal:Checking locks... 2016-01-22T04:57:57.603 INFO:teuthology.kill:2016-01-22 04:57:57,603.603 INFO:teuthology.nuke:console ready on mira078.ipmi.sepia.ceph.com 2016-01-22T04:57:57.604 INFO:teuthology.kill:2016-01-22 04:57:57,603.603 INFO:teuthology.task.internal:Checking locks... 2016-01-22T04:57:57.625 INFO:teuthology.kill:2016-01-22 04:57:57,625.625 INFO:teuthology.task.internal:Opening connections... 2016-01-22T04:57:57.639 INFO:teuthology.kill:2016-01-22 04:57:57,638.638 ERROR:teuthology.nuke:Could not nuke {'ubuntu@mira078.front.sepia.ceph.com': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQ ABAAABAQDbfD/mcGMuU/3lVzxg2tF82DDXuCUemKjVt0TBXMaRA+qofkwh/lJRx/9RDlwMf/ZueSFWDdVa4xLGhXqNjweJo1N1nuvlqdAua4HReEVnkQSZI2Ox6uZIyIo23ZVrjysItgTkZT05I61lIp3nXai7xJid9bB4rba3ru+Js/0xyZYk3vCrQfIM85HVFRXn1eOODKFKTQYZvgyr5i9ar+nTAL3Wy4hFpBdSJg5oZrmXmqYjbE2huFQW1RYKBB1CEuQE2fL9F6YwMnTX2XcRzaaDYkW/TLcabcqeOGmPAyOkXWp1g59l5iCs4dtyiwjuYXOBSOiiosQWAvGRb/tJXWkZ'} 2016-01-22T04:57:57.640 INFO:teuthology.kill:Traceback (most recent call last): 2016-01-22T04:57:57.640 INFO:teuthology.kill: File "/home/teuthworker/src/teuthology_master/teuthology/nuke.py", line 613, in nuke_one 2016-01-22T04:57:57.640 INFO:teuthology.kill: nuke_helper(ctx, should_unlock) 2016-01-22T04:57:57.641 INFO:teuthology.kill: File "/home/teuthworker/src/teuthology_master/teuthology/nuke.py", line 663, in nuke_helper 2016-01-22T04:57:57.641 INFO:teuthology.kill: check_lock(ctx, None, check_up=False) 2016-01-22T04:57:57.641 INFO:teuthology.kill: File "/home/teuthworker/src/teuthology_master/teuthology/task/internal.py", line 227, in check_lock 2016-01-22T04:57:57.641 INFO:teuthology.kill: owner=ctx.owner, 2016-01-22T04:57:57.642 INFO:teuthology.kill:AssertionError: machine ubuntu@mira078.front.sepia.ceph.com is locked by scheduled_sage@teuthology, not scheduled_gregf@teuthology
But the job had unlocked the node before deadlocking (somehow):
2016-01-20T17:33:02.802 INFO:teuthology.task.install:Purging /var/lib/ceph on ubuntu@mira078.front.sepia.ceph.com 2016-01-20T17:33:02.803 INFO:teuthology.orchestra.run.mira078:Running: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 2016-01-20T17:33:02.911 DEBUG:teuthology.parallel:result is None 2016-01-20T17:33:02.911 INFO:teuthology.nuke:Installed packages removed. 2016-01-20T17:33:02.979 INFO:teuthology.lock:unlocked mira078.front.sepia.ceph.com 2016-01-20T17:34:01.115 DEBUG:teuthology.orchestra.remote:timed out 2016-01-20T17:34:01.115 DEBUG:teuthology.misc:waited 182.029175997 2016-01-20T17:34:02.116 INFO:teuthology.misc:trying to connect to ubuntu@mira115.front.sepia.ceph.com 2016-01-20T17:34:02.117 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'mira115.front.sepia.ceph.com', 'timeout': 60} 2016-01-20T17:34:16.075 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104) 2016-01-20T17:35:02.119 DEBUG:teuthology.orchestra.remote:timed out 2016-01-20T17:35:02.120 DEBUG:teuthology.misc:waited 243.034048796 2016-01-20T17:35:03.121 INFO:teuthology.misc:trying to connect to ubuntu@mira115.front.sepia.ceph.com 2016-01-20T17:35:03.122 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'mira115.front.sepia.ceph.com', 'timeout': 60} 2016-01-20T17:44:27.403 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)
There are three issues here:
1. Why did the job deadlock there?
2. -kill
/-nuke
threw an error instead of gracefully skipping that node
3. -worker
deadlocked also!
Updated by Yuri Weinstein over 8 years ago
I can speculate on #1 - because mira115 had bad drives
Updated by Zack Cerza almost 8 years ago
- Status changed from New to Resolved
This should actually be resolved
Actions