Feature #7008: pulpito: show pass/fail stats by machine - teuthology - Ceph

Actions

Copy link

Feature #7008

closed

pulpito: show pass/fail stats by machine

Added by Sage Weil over 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Zack Cerza

Category:

Target version:

sprint4

% Done:

100%

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

Subtasks 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil over 10 years ago

Not infrequently we have a problem on a test machine that makes every job that touches the machine fail. It is tedious to identify those cases currently, but would be trivial to spot if we could easily see the pass/fail status of the last 24 hours (or whatever) by machine.

Actions

Copy link

Updated by Ian Colle over 10 years ago

Assignee set to Zack Cerza

Actions

Copy link

Updated by Zack Cerza over 10 years ago

Just getting around to thinking hard about this, since I opted to implement filtering by machine type first. Currently we store the targets dict for each job but we can't really query that. To properly track job results by individual machines, a schema change is in order.

Actions

Copy link

Updated by Ian Colle over 10 years ago

Target version deleted (~~v0.76~~)

Actions

Copy link

Updated by Zack Cerza about 10 years ago

So, paddles queries currently all return lists of runs; from there it is possible to list jobs within runs. This feature will require the ability to query for jobs irrespective of their runs.

Back to the "how we store targets" challenge, the possibilities that I see are:

First, we create a new column in the jobs table: target_nodes or something similar - a simple space-delimited list of hostnames. Queries would be easy using the LIKE operator.
Second, we create a new column in the jobs table using the PostgreSQL-specific array feature - again, a list of hostnames. I don't actually know how easy this is to query. I'm not sure if it offers anything more compelling than the first option, and would likely reduce our database compatibility.
Third, we create a whole separate targets table, where each row represents a single target set used by exactly one job. There we could easily represent jobs running on mixed combinations of machine types, OSes, etc. Clearly this approach would be far more complex, however.

I'm currently leaning towards the first option; the third could of course be bolted on later if we end up needing those features.

Actions

Copy link

Updated by Ian Colle about 10 years ago

Target version set to sprint3

Actions

Copy link

Updated by Zack Cerza about 10 years ago

After some searching I found an operator that can actually query the content of the custom type we're using to store JSON objects:

>>> Job.query.filter(Job.targets.contains('ubuntu@plana32.front.sepia.ceph.com')).count()
2336L
>>> Job.query.filter(Job.targets.contains('plana32.front.sepia.ceph.com')).count()                                                                                                
0L
>>> Job.query.filter(Job.targets.contains('%@plana32.front.sepia.ceph.com')).count()
2336L

Unfortunately it's not fast and it won't let us run queries like "show me the machine with the most failures" with any sort of acceptable speed.

I may have to create a separate targets table - a larger undertaking, but one that is probably required for the queue functionality anyway.

Actions

Copy link

Updated by Zack Cerza about 10 years ago

Getting really close:

? curl "http://localhost:8080/nodes/plana72.front.sepia.ceph.com/job_stats" 
{"fail": 1, "unknown": 1, "running": 0, "dead": 0, "pass": 12}

(local development db; data not complete)

Actions

Copy link

Updated by Zack Cerza about 10 years ago

Finished #7789 - will work on the Pulpito side at the next chance I get.

Actions

Copy link

#10

Updated by Ian Colle about 10 years ago

Target version changed from sprint3 to sprint4

Actions

Copy link

#11

Updated by Zack Cerza about 10 years ago

Status changed from New to In Progress

Actions

Copy link

#12

Updated by Ian Colle about 10 years ago

Translation missing: en.field_story_points set to 1.0

Actions

Copy link

#13

Updated by Ian Colle about 10 years ago

Translation missing: en.field_story_points changed from 1.0 to 8.0

Actions

Copy link

#14

Updated by Zack Cerza about 10 years ago

Almost done:
http://pulpito.front.sepia.ceph.com/stats/nodes?machine_type=vps

Things that still need work:
- Link to the view in pulpito's header (may need to reorganize the header for this)
- UI for selecting machine type to filter by

Actions

Copy link

#15

Updated by Zack Cerza about 10 years ago

Right now I'm trying to figure out why I'm getting 502 Bad Gateway errors when accessing /stats/nodes/ without specifying a machine type. The query is slow; I know that. But it shouldn't be failing.

I've tried tweaking various timeouts in the apache configuration, but when I bothered to see if paddles itself was timing out - it was. I'm not sure why that would be happening and I haven't yet found a reference to timeouts in its documentation.

I'd really like to get this fixed. Or bite the bullet and implement paging (which needs to happen anyway).

Actions

Copy link

#16

Updated by Zack Cerza about 10 years ago

Looks like it is probably gunicorn:
http://docs.gunicorn.org/en/latest/settings.html?highlight=timeout

Actions

Copy link

#17

Updated by Zack Cerza about 10 years ago

Rewrote the query to use an entirely different approach, and also added an new argument controlling how far to look back in history. Default is 14 days. Much faster now. Just need to figure out how to expose the new feature in the UI.

Actions

Copy link

#18