Tasks #15509: create a read-only access to paddles postgresql database - sepia - Ceph

Actions

Copy link

Tasks #15509

open

create a read-only access to paddles postgresql database

Added by Loïc Dachary about 8 years ago. Updated over 7 years ago.

Status:

New

Priority:

Normal

Assignee:

David Galloway

Category:

Target version:

% Done:

Tags:

Reviewed:

Affected Versions:

Description

For mirroring purposes.

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

<loicd> dmick: zackc I'd like to mirror the paddles database and make it accessible read-only. Would it be possible to get access to the paddles machine @ ovh & sepia for that purpose ? 
<loicd> dmick: zackc although I could mirror almost everything from the archive yaml files, the os_type/os_version are missing
<dmick> why do you want a mirror?  geo access speed?
<loicd> for everyone to access and query, people doing backport primarily
<loicd> dmick: the paddles server are not publicly accessible
-*- dmick has to remember teh architecture again
<loicd> dmick: jobs are stored in paddles by teuthology, pulpito queries paddles to display the web front end, paddles has no authentication mechanism whatsoever, no way to segregate read-only/read-write
<loicd> that's the short version of it ;-)
<dmick> so paddles's rest interface is accessible to sepia.  it's not 'public', but it's as public as sepia, right?
<loicd> yes, you can access it if you have access to sepia
<dmick> but you want people not on the vpn to have access to the data in it as well
<loicd> yes, read-only
<loicd> that's the most frequent blocker
<dmick> i'm not saying this is a bad idea, just being thick about why again
<loicd> well, if you're into clicking on a web page pulpito is great
<dmick> you want programmatic access to run information to write other tools?
<dmick> I mean that makes total sense
<dmick> just drawing a line from a to z
<loicd> dmick: http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_describe_a_test_result
<loicd> dmick: this is what we've been doing and keep doing
<loicd> semi-manual
<loicd> and only works if you have access to paddles
<loicd> dmick: we copy/paste the output into http://tracker.ceph.com/issues/14692#note-5
<loicd> and that gives us a convenient list of failures to comment on and a reminder of what's left to analyze
<dmick> so the answer is "yes, and the other tools are the things that maintain these tracker issues" 
<dmick> that's fine
<loicd> yes
<dmick> and for some reason there are people that want to run the tools that create these tracker issues but do not want/can't be on sepia
<dmick> surely this is doable
<loicd> thanks :-)
<dmick> but it's work, and it's something else to be maintained, so I'm making certain there's no reasonable alternative
<dmick> obviously you've thought about this a lot more than me
<loicd> oh, I'm going to maintain that
<dmick> so it's postgres.  do you know how one would do a snapshot so it's consistent?
<dmick> and is the right thing for it to be a "pull from mirror" or "push to mirror", I wonder?
<dmick> moreover, one wonders if there's already a 'backup' going on that could be leveraged, or if not, maybe there should be
<loicd> there is no need for that, select job where updated >= lasttime will be good enough. there only is a need to mirror finished jobs which makes things simpler.
<dmick> so....are you talking about a mirror, or are you talking about a periodic query?
<loicd> it's a period query which job is to mirror all finished jobs elsewhere 
<loicd> for forensic analysis
<dmick> ok.  but you're not planning on running a postgres or paddles instance against it, then?
<dmick> or are you?
<loicd> the idea is to mirror into a postgres instance and give people read-only access to that postgres instance
<dmick> so this may be my inexperience with DBMSes in general and postgres specifically
<dmick> but to me, "mirror" means "consistent dump of the entire db", and that's pretty different from "a periodic extract".  But, what I think of as a "periodic extract" isn't something you could set up a db server against either
<dmick> so let's say you got access to submit a SELECT against the running paddles postgres.  what exactly would you do with that output?
<loicd> my bad for suggesting a mirror which means an exact copy
<dmick> (and I suck at filling gaps in things I'm not familiar with)
<loicd> I meant extracting records and copying them over to another postgresql database
<dmick> how exactly would you do this extraction?  is the output of a SELECT sufficient to populate a working set of tables?  i would think not (cross-linked tables and indices, need for schema)
<loicd> select from the jobs table for the most part + select from the runs table for the branch name because it's not in the jobs table + extract the ceph / ceph-qa-suite from the config.yaml found in the archive_path for the job because it's not in the base (yet, pr on the way)
<loicd> s/from the config/sha1 from the config/
<joshd> it sounds like loicd only wants a known subset of the data from a single query, so schema, indices, etc. don't need copying
<dmick> but if it's going into a secondary readonly postgres
<dmick> doesn't it at least need schema?
<joshd> it would need to be created once, but it needn't be an exact clone of the db is what i'm getting at
<dmick> and it seems like it would be *way* easier to dump a backup and import it
<loicd> dmick: yes, I'll make a schema
<loicd> dmick: the dump is going to be in the vicinity of ~5GB, which is a bit much
<loicd> and it won't complement the data with what's currently missing (i.e. os_type/os_version)
-*- dmick tries and fails to look at the db
<loicd> the read-only shema will also distinguish from which origin the records come from so there is only place to query results from ovh, sepia and all short lived clusters (that's the added bonus, eventually)
<dmick> ok.  so you're really asking about having direct query access to both paddles' dbs and nothing else.  I wish, as usual, that we could make the central service work for all the needs we want it for, but it sounds like you're already deep into the plan
<dmick> so my current state is "I don't know enough about postgres access control and administration to even log in with psql for myself" 
<loicd> we've circled that problem for a very long time indeed ;-)
<dmick> but I'm pretty sure it has a fair amount of access control per-user
<dmick> I'm just not at all an expert on how to set it up
<dmick> (roles, in pg-ese, I guess)
<loicd> dmick: I suspect knowing how to do it won't be enough, you'd have to know how to do that the ansible way
<dmick> possibly
<dmick> if so it's in ceph-cm-ansible
<dmick> setup_db.yml, indeed

Actions

Copy link

Updated by Dan Mick about 8 years ago

After much research, I think the easy way to do this would be to create a user like paddles (which looks to have pretty much readonly access) for psql select use

Actions

Copy link

Updated by Loïc Dachary about 8 years ago

sudo -u postgres psql -c "CREATE USER rouser with PASSWORD 'rouser';" 
sudo -u postgres psql -c "GRANT SELECT ON ALL TABLES IN SCHEMA public TO rouser;"

alternatively a user can be granted readonly access with no password via pg_hba.conf (trust)

Actions

Copy link

Updated by Zack Cerza about 8 years ago

I really wanted to find a way to avoid having to duplicate the DB, but after talking with Loic I'm convinced it's the right thing to do - at least for now. It'll be pretty easy to grant SELECT access to the db. What we need to do is figure out where the mirroring stuff will run from, and what host it will mirror to. I suppose those two could be the same host, but I'll want to get David's opinion on that.

Actions

Copy link

Updated by David Galloway about 8 years ago

Zack Cerza wrote:

I really wanted to find a way to avoid having to duplicate the DB, but after talking with Loic I'm convinced it's the right thing to do - at least for now. It'll be pretty easy to grant SELECT access to the db. What we need to do is figure out where the mirroring stuff will run from, and what host it will mirror to. I suppose those two could be the same host, but I'll want to get David's opinion on that.

Is the tl;dr version of this just "We want a separate host with a mirror of paddles' db that we have read-only access to" ?

Actions

Copy link

Updated by David Galloway almost 8 years ago

Assignee set to David Galloway

Actions

Copy link

Updated by David Galloway over 7 years ago

Chatted with Loic more about this. Initially, my plan was to just create a public reverse proxy that would only allow GET http requests but Loic's desired use case is to bypass the API and query the database directly.

Actions

Copy link

Updated by David Galloway over 7 years ago

I'm aiming to get this done when the Sepia lab has to move. The tentative schedule for this is the second week of February.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure » sepia

Custom queries

Tasks #15509

create a read-only access to paddles postgresql database

Updated by Loïc Dachary about 8 years ago

Updated by Dan Mick about 8 years ago

Updated by Loïc Dachary about 8 years ago

Updated by Zack Cerza about 8 years ago

Updated by David Galloway about 8 years ago

Updated by David Galloway almost 8 years ago

Updated by David Galloway over 7 years ago

Updated by David Galloway over 7 years ago