Project

General

Profile

Actions

Bug #14518

closed

Bad interaction between deadlocking teuthology processes

Added by Yuri Weinstein over 8 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Sometimes (maybe after disk failures?) pulpito still shows runs that had been killed.

As of today see "greg-fs-speculative-119" in http://pulpito.ceph.com/?status=running
This run is actually dead.

Not sure what would be a possible solution here, maybe a way in pulpito UI to clean up those, Zack?

Actions #1

Updated by Zack Cerza over 8 years ago

  • Project changed from pulpito to teuthology
  • Subject changed from pulpito shows killed runs to Bad interaction between deadlocking teuthology processes
  • Priority changed from Normal to High

Not pulpito. The worker broke:

2016-01-20T16:56:40.196 INFO:teuthology.worker:Running job 34218
2016-01-20T16:56:40.220 INFO:teuthology.worker:Job archive: /var/lib/teuthworker/archive/gregf-2016-01-19_23:14:25-fs-greg-fs-speculative-119---basic-mira/34218
2016-01-20T16:56:40.221 INFO:teuthology.worker:Job PID: 18877
2016-01-20T16:56:40.221 INFO:teuthology.worker:Running with watchdog
2016-01-20T16:58:40.222 DEBUG:teuthology.worker:Worker log: /var/lib/teuthworker/archive/worker_logs/worker.mira.23940
2016-01-22T04:57:56.450 WARNING:teuthology.worker:Job ran longer than 129600s. Killing...
2016-01-22T04:57:56.822 INFO:teuthology.kill:Killing Pids: set([18877])
2016-01-22T04:57:56.832 INFO:teuthology.kill:Nuking machines: ['mira078', 'mira115']
2016-01-22T04:57:57.238 INFO:teuthology.kill:2016-01-22 04:57:57,237.237 INFO:teuthology.nuke:targets:
2016-01-22T04:57:57.239 INFO:teuthology.kill:  ubuntu@mira078.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDbfD/mcGMuU/3lVzxg2tF82DDXuCUemKjVt0TBXMaRA+qofkwh/lJRx/9RD
lwMf/ZueSFWDdVa4xLGhXqNjweJo1N1nuvlqdAua4HReEVnkQSZI2Ox6uZIyIo23ZVrjysItgTkZT05I61lIp3nXai7xJid9bB4rba3ru+Js/0xyZYk3vCrQfIM85HVFRXn1eOODKFKTQYZvgyr5i9ar+nTAL3Wy4hFpBdSJg5oZrmXmqYjbE
2huFQW1RYKBB1CEuQE2fL9F6YwMnTX2XcRzaaDYkW/TLcabcqeOGmPAyOkXWp1g59l5iCs4dtyiwjuYXOBSOiiosQWAvGRb/tJXWkZ
2016-01-22T04:57:57.239 INFO:teuthology.kill:  ubuntu@mira115.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOEBlMt8gBzs9rDmsi9oBMTrAouaeAFqQjHmlOaIPqcKTN5pJvtR70pAyoi
myNiMuBOPRVqycJIL+jXdhcZlNz+uB96X6+MFhsjV5WzC+UhWEbUra95m12hm/LBp+QsudtkykngynMuspMfM7Ullod5RUMHvTdsNQVQxUIsdwJmsPBpBLKHN0//T9Cokmv6yvT+qQk5mRkBCg3MocS8CPEfXSygdj1uKX4V8XwZJ+P627vjNJP5KtJi0f6nU1fmlEvHP6OSz/wN0Ks27JYo8YOaQaF/EQBMYQWUF/MzaM3HqCxXnpqpbAD/60bUBfayTbJe8w7YFuORCgOVoPMtNt1
2016-01-22T04:57:57.284 INFO:teuthology.kill:2016-01-22 04:57:57,283.283 INFO:teuthology.nuke:checking console status of mira115.ipmi.sepia.ceph.com
2016-01-22T04:57:57.292 INFO:teuthology.kill:2016-01-22 04:57:57,291.291 INFO:teuthology.nuke:checking console status of mira078.ipmi.sepia.ceph.com
2016-01-22T04:57:57.591 INFO:teuthology.kill:2016-01-22 04:57:57,590.590 INFO:teuthology.nuke:console ready on mira115.ipmi.sepia.ceph.com
2016-01-22T04:57:57.591 INFO:teuthology.kill:2016-01-22 04:57:57,591.591 INFO:teuthology.task.internal:Checking locks...
2016-01-22T04:57:57.603 INFO:teuthology.kill:2016-01-22 04:57:57,603.603 INFO:teuthology.nuke:console ready on mira078.ipmi.sepia.ceph.com
2016-01-22T04:57:57.604 INFO:teuthology.kill:2016-01-22 04:57:57,603.603 INFO:teuthology.task.internal:Checking locks...
2016-01-22T04:57:57.625 INFO:teuthology.kill:2016-01-22 04:57:57,625.625 INFO:teuthology.task.internal:Opening connections...
2016-01-22T04:57:57.639 INFO:teuthology.kill:2016-01-22 04:57:57,638.638 ERROR:teuthology.nuke:Could not nuke {'ubuntu@mira078.front.sepia.ceph.com': 'ssh-rsa AAAAB3NzaC1yc2EAAAADAQ
ABAAABAQDbfD/mcGMuU/3lVzxg2tF82DDXuCUemKjVt0TBXMaRA+qofkwh/lJRx/9RDlwMf/ZueSFWDdVa4xLGhXqNjweJo1N1nuvlqdAua4HReEVnkQSZI2Ox6uZIyIo23ZVrjysItgTkZT05I61lIp3nXai7xJid9bB4rba3ru+Js/0xyZYk3vCrQfIM85HVFRXn1eOODKFKTQYZvgyr5i9ar+nTAL3Wy4hFpBdSJg5oZrmXmqYjbE2huFQW1RYKBB1CEuQE2fL9F6YwMnTX2XcRzaaDYkW/TLcabcqeOGmPAyOkXWp1g59l5iCs4dtyiwjuYXOBSOiiosQWAvGRb/tJXWkZ'}
2016-01-22T04:57:57.640 INFO:teuthology.kill:Traceback (most recent call last):
2016-01-22T04:57:57.640 INFO:teuthology.kill:  File "/home/teuthworker/src/teuthology_master/teuthology/nuke.py", line 613, in nuke_one
2016-01-22T04:57:57.640 INFO:teuthology.kill:    nuke_helper(ctx, should_unlock)
2016-01-22T04:57:57.641 INFO:teuthology.kill:  File "/home/teuthworker/src/teuthology_master/teuthology/nuke.py", line 663, in nuke_helper
2016-01-22T04:57:57.641 INFO:teuthology.kill:    check_lock(ctx, None, check_up=False)
2016-01-22T04:57:57.641 INFO:teuthology.kill:  File "/home/teuthworker/src/teuthology_master/teuthology/task/internal.py", line 227, in check_lock
2016-01-22T04:57:57.641 INFO:teuthology.kill:    owner=ctx.owner,
2016-01-22T04:57:57.642 INFO:teuthology.kill:AssertionError: machine ubuntu@mira078.front.sepia.ceph.com is locked by scheduled_sage@teuthology, not scheduled_gregf@teuthology

But the job had unlocked the node before deadlocking (somehow):

2016-01-20T17:33:02.802 INFO:teuthology.task.install:Purging /var/lib/ceph on ubuntu@mira078.front.sepia.ceph.com
2016-01-20T17:33:02.803 INFO:teuthology.orchestra.run.mira078:Running: "sudo rm -rf --one-file-system -- /var/lib/ceph || true ; test -d /var/lib/ceph && sudo find /var/lib/ceph -mindepth 1 -maxdepth 2 -type d -exec umount '{}' ';' ; sudo rm -rf --one-file-system -- /var/lib/ceph" 
2016-01-20T17:33:02.911 DEBUG:teuthology.parallel:result is None
2016-01-20T17:33:02.911 INFO:teuthology.nuke:Installed packages removed.
2016-01-20T17:33:02.979 INFO:teuthology.lock:unlocked mira078.front.sepia.ceph.com
2016-01-20T17:34:01.115 DEBUG:teuthology.orchestra.remote:timed out
2016-01-20T17:34:01.115 DEBUG:teuthology.misc:waited 182.029175997
2016-01-20T17:34:02.116 INFO:teuthology.misc:trying to connect to ubuntu@mira115.front.sepia.ceph.com
2016-01-20T17:34:02.117 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'mira115.front.sepia.ceph.com', 'timeout': 60}
2016-01-20T17:34:16.075 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)
2016-01-20T17:35:02.119 DEBUG:teuthology.orchestra.remote:timed out
2016-01-20T17:35:02.120 DEBUG:teuthology.misc:waited 243.034048796
2016-01-20T17:35:03.121 INFO:teuthology.misc:trying to connect to ubuntu@mira115.front.sepia.ceph.com
2016-01-20T17:35:03.122 DEBUG:teuthology.orchestra.connection:{'username': 'ubuntu', 'hostname': 'mira115.front.sepia.ceph.com', 'timeout': 60}
2016-01-20T17:44:27.403 ERROR:paramiko.transport:Socket exception: Connection reset by peer (104)

There are three issues here:
1. Why did the job deadlock there?
2. -kill/-nuke threw an error instead of gracefully skipping that node
3. -worker deadlocked also!

Actions #2

Updated by Yuri Weinstein over 8 years ago

I can speculate on #1 - because mira115 had bad drives

Actions #3

Updated by Zack Cerza about 8 years ago

  • Assignee set to Zack Cerza
Actions #4

Updated by Zack Cerza almost 8 years ago

  • Status changed from New to Resolved

This should actually be resolved

Actions

Also available in: Atom PDF