Feature #8356
open
tests are being marked as hung despite never actually getting machines locked
Added by Greg Farnum almost 10 years ago.
Updated almost 10 years ago.
Description
teuthology-2014-05-12_23:02:18-knfs-master-testing-basic-plana/251991, for instance. I've seen this a couple other times in emails this week and will update the ticket with any more that I see.
I'm not sure if the root problem here is that the test is never getting its machines (do we just have too many runs racing against each other for unlocked machines?), or that it's getting declared as hung too early, or what.
- Description updated (diff)
Hmm, it may have only been marked as hung after somebody killed the run yesterday. But it wasn't killed until 6 hours after its peers finished, so there's definitely some kind of issue here.
So the way the emails have always worked is pretty wonky.
When a test run containing X number of jobs is scheduled, X+1 are actually created in beanstalkd. The last one doesn't contain any tests and has the flag last_in_suite = True
. When the worker picks up that job, it kicks off a teuthology-results
process which:
- Looks for subdirs of the archive dir to see which jobs exist
- For each subdir, looks for a
summary.yaml
inside it.
- If that is not present, it assumes the job is running
- If there are running jobs, waits a certain amount of time (default 6h)
- Decides that any running job is hung
- Sends the email
I'd be happy to put some time into making that less, uh, crazy.
- Tracker changed from Bug to Feature
- Translation missing: en.field_story_points set to 4.0
Marking this as a feature so I can budget time to work on it. I have not decided how exactly I'll do it, though.
Also available in: Atom
PDF