Project

General

Profile

Actions

Feature #7008

closed

pulpito: show pass/fail stats by machine

Added by Sage Weil over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

100%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:

Subtasks 1 (0 open1 closed)

Subtask #7789: paddles: add table/class for nodesResolvedZack Cerza03/19/2014

Actions
Actions #1

Updated by Sage Weil over 10 years ago

Not infrequently we have a problem on a test machine that makes every job that touches the machine fail. It is tedious to identify those cases currently, but would be trivial to spot if we could easily see the pass/fail status of the last 24 hours (or whatever) by machine.

Actions #2

Updated by Ian Colle over 10 years ago

  • Assignee set to Zack Cerza
Actions #3

Updated by Zack Cerza over 10 years ago

Just getting around to thinking hard about this, since I opted to implement filtering by machine type first. Currently we store the targets dict for each job but we can't really query that. To properly track job results by individual machines, a schema change is in order.

Actions #4

Updated by Ian Colle over 10 years ago

  • Target version deleted (v0.76)
Actions #5

Updated by Zack Cerza about 10 years ago

So, paddles queries currently all return lists of runs; from there it is possible to list jobs within runs. This feature will require the ability to query for jobs irrespective of their runs.

Back to the "how we store targets" challenge, the possibilities that I see are:
  • First, we create a new column in the jobs table: target_nodes or something similar - a simple space-delimited list of hostnames. Queries would be easy using the LIKE operator.
  • Second, we create a new column in the jobs table using the PostgreSQL-specific array feature - again, a list of hostnames. I don't actually know how easy this is to query. I'm not sure if it offers anything more compelling than the first option, and would likely reduce our database compatibility.
  • Third, we create a whole separate targets table, where each row represents a single target set used by exactly one job. There we could easily represent jobs running on mixed combinations of machine types, OSes, etc. Clearly this approach would be far more complex, however.

I'm currently leaning towards the first option; the third could of course be bolted on later if we end up needing those features.

Actions #6

Updated by Ian Colle about 10 years ago

  • Target version set to sprint3
Actions #7

Updated by Zack Cerza about 10 years ago

After some searching I found an operator that can actually query the content of the custom type we're using to store JSON objects:

>>> Job.query.filter(Job.targets.contains('ubuntu@plana32.front.sepia.ceph.com')).count()
2336L
>>> Job.query.filter(Job.targets.contains('plana32.front.sepia.ceph.com')).count()                                                                                                
0L
>>> Job.query.filter(Job.targets.contains('%@plana32.front.sepia.ceph.com')).count()
2336L

Unfortunately it's not fast and it won't let us run queries like "show me the machine with the most failures" with any sort of acceptable speed.

I may have to create a separate targets table - a larger undertaking, but one that is probably required for the queue functionality anyway.

Actions #8

Updated by Zack Cerza about 10 years ago

Getting really close:

? curl "http://localhost:8080/nodes/plana72.front.sepia.ceph.com/job_stats" 
{"fail": 1, "unknown": 1, "running": 0, "dead": 0, "pass": 12}

(local development db; data not complete)

Actions #9

Updated by Zack Cerza about 10 years ago

Finished #7789 - will work on the Pulpito side at the next chance I get.

Actions #10

Updated by Ian Colle about 10 years ago

  • Target version changed from sprint3 to sprint4
Actions #11

Updated by Zack Cerza about 10 years ago

  • Status changed from New to In Progress
Actions #12

Updated by Ian Colle about 10 years ago

  • Translation missing: en.field_story_points set to 1.0
Actions #13

Updated by Ian Colle about 10 years ago

  • Translation missing: en.field_story_points changed from 1.0 to 8.0
Actions #14

Updated by Zack Cerza about 10 years ago

Almost done:
http://pulpito.front.sepia.ceph.com/stats/nodes?machine_type=vps

Things that still need work:
- Link to the view in pulpito's header (may need to reorganize the header for this)
- UI for selecting machine type to filter by

Actions #15

Updated by Zack Cerza about 10 years ago

Right now I'm trying to figure out why I'm getting 502 Bad Gateway errors when accessing /stats/nodes/ without specifying a machine type. The query is slow; I know that. But it shouldn't be failing.

I've tried tweaking various timeouts in the apache configuration, but when I bothered to see if paddles itself was timing out - it was. I'm not sure why that would be happening and I haven't yet found a reference to timeouts in its documentation.

I'd really like to get this fixed. Or bite the bullet and implement paging (which needs to happen anyway).

Actions #17

Updated by Zack Cerza about 10 years ago

Rewrote the query to use an entirely different approach, and also added an new argument controlling how far to look back in history. Default is 14 days. Much faster now. Just need to figure out how to expose the new feature in the UI.

Actions #18

Updated by Zack Cerza about 10 years ago

Almost done - going to write a test or two before submitting a PR.

Actions #19

Updated by Zack Cerza about 10 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee changed from Zack Cerza to Alfredo Deza
Actions #20

Updated by Alfredo Deza about 10 years ago

  • Assignee deleted (Alfredo Deza)
Actions #21

Updated by Zack Cerza about 10 years ago

  • Status changed from Fix Under Review to Resolved
  • Assignee set to Zack Cerza
Actions

Also available in: Atom PDF