Feature #7008
closedpulpito: show pass/fail stats by machine
100%
Updated by Sage Weil over 10 years ago
Not infrequently we have a problem on a test machine that makes every job that touches the machine fail. It is tedious to identify those cases currently, but would be trivial to spot if we could easily see the pass/fail status of the last 24 hours (or whatever) by machine.
Updated by Zack Cerza over 10 years ago
Just getting around to thinking hard about this, since I opted to implement filtering by machine type first. Currently we store the targets dict for each job but we can't really query that. To properly track job results by individual machines, a schema change is in order.
Updated by Zack Cerza about 10 years ago
So, paddles queries currently all return lists of runs; from there it is possible to list jobs within runs. This feature will require the ability to query for jobs irrespective of their runs.
Back to the "how we store targets" challenge, the possibilities that I see are:- First, we create a new column in the
jobs
table:target_nodes
or something similar - a simple space-delimited list of hostnames. Queries would be easy using theLIKE
operator. - Second, we create a new column in the
jobs
table using the PostgreSQL-specific array feature - again, a list of hostnames. I don't actually know how easy this is to query. I'm not sure if it offers anything more compelling than the first option, and would likely reduce our database compatibility. - Third, we create a whole separate
targets
table, where each row represents a single target set used by exactly one job. There we could easily represent jobs running on mixed combinations of machine types, OSes, etc. Clearly this approach would be far more complex, however.
I'm currently leaning towards the first option; the third could of course be bolted on later if we end up needing those features.
Updated by Zack Cerza about 10 years ago
After some searching I found an operator that can actually query the content of the custom type we're using to store JSON objects:
>>> Job.query.filter(Job.targets.contains('ubuntu@plana32.front.sepia.ceph.com')).count() 2336L >>> Job.query.filter(Job.targets.contains('plana32.front.sepia.ceph.com')).count() 0L >>> Job.query.filter(Job.targets.contains('%@plana32.front.sepia.ceph.com')).count() 2336L
Unfortunately it's not fast and it won't let us run queries like "show me the machine with the most failures" with any sort of acceptable speed.
I may have to create a separate targets
table - a larger undertaking, but one that is probably required for the queue functionality anyway.
Updated by Zack Cerza about 10 years ago
Getting really close:
? curl "http://localhost:8080/nodes/plana72.front.sepia.ceph.com/job_stats" {"fail": 1, "unknown": 1, "running": 0, "dead": 0, "pass": 12}
(local development db; data not complete)
Updated by Zack Cerza about 10 years ago
Finished #7789 - will work on the Pulpito side at the next chance I get.
Updated by Ian Colle about 10 years ago
- Target version changed from sprint3 to sprint4
Updated by Zack Cerza about 10 years ago
- Status changed from New to In Progress
Updated by Ian Colle about 10 years ago
- Translation missing: en.field_story_points set to 1.0
Updated by Ian Colle about 10 years ago
- Translation missing: en.field_story_points changed from 1.0 to 8.0
Updated by Zack Cerza about 10 years ago
Almost done:
http://pulpito.front.sepia.ceph.com/stats/nodes?machine_type=vps
Things that still need work:
- Link to the view in pulpito's header (may need to reorganize the header for this)
- UI for selecting machine type to filter by
Updated by Zack Cerza about 10 years ago
Right now I'm trying to figure out why I'm getting 502 Bad Gateway errors when accessing /stats/nodes/ without specifying a machine type. The query is slow; I know that. But it shouldn't be failing.
I've tried tweaking various timeouts in the apache configuration, but when I bothered to see if paddles itself was timing out - it was. I'm not sure why that would be happening and I haven't yet found a reference to timeouts in its documentation.
I'd really like to get this fixed. Or bite the bullet and implement paging (which needs to happen anyway).
Updated by Zack Cerza about 10 years ago
Looks like it is probably gunicorn:
http://docs.gunicorn.org/en/latest/settings.html?highlight=timeout
Updated by Zack Cerza about 10 years ago
Rewrote the query to use an entirely different approach, and also added an new argument controlling how far to look back in history. Default is 14 days. Much faster now. Just need to figure out how to expose the new feature in the UI.
Updated by Zack Cerza about 10 years ago
Almost done - going to write a test or two before submitting a PR.
Updated by Zack Cerza about 10 years ago
- Status changed from In Progress to Fix Under Review
- Assignee changed from Zack Cerza to Alfredo Deza
Updated by Zack Cerza about 10 years ago
- Status changed from Fix Under Review to Resolved
- Assignee set to Zack Cerza