tests are being marked as hung despite never actually getting machines locked
teuthology-2014-05-12_23:02:18-knfs-master-testing-basic-plana/251991, for instance. I've seen this a couple other times in emails this week and will update the ticket with any more that I see.
I'm not sure if the root problem here is that the test is never getting its machines (do we just have too many runs racing against each other for unlocked machines?), or that it's getting declared as hung too early, or what.
#3 Updated by Zack Cerza over 5 years ago
So the way the emails have always worked is pretty wonky.
When a test run containing X number of jobs is scheduled, X+1 are actually created in beanstalkd. The last one doesn't contain any tests and has the flag
last_in_suite = True. When the worker picks up that job, it kicks off a
teuthology-results process which:
- Looks for subdirs of the archive dir to see which jobs exist
- For each subdir, looks for a
- If that is not present, it assumes the job is running
- If there are running jobs, waits a certain amount of time (default 6h)
- Decides that any running job is hung
- Sends the email
I'd be happy to put some time into making that less, uh, crazy.