Bug #19402: git.ceph.com instability - sepia - Ceph

Actions

Copy link

Bug #19402

closed

git.ceph.com instability

Added by David Galloway about 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

David Galloway

Category:

Infrastructure Service

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

git.ceph.com has been failing to respond over HTTP a few times during early morning US hours the past few weeks. This may not matter all that much since it's just the web UI but we're unsure if it's also affecting jobs.

I've gathered some stats on what's happening on the machine during these times. Here's a 10 minute snapshot during an HTTP outage: https://paste.fedoraproject.org/paste/pCTM-~6yo4LP5yg3MgjLdl5M1UNdIGYhyRLivL9gydE=

Compare to quiet/stable time: https://paste.fedoraproject.org/paste/H6ptu0dxeEXjXpmppzPuE15M1UNdIGYhyRLivL9gydE=

My current hypothesis is the addition of jobs in OVH is too much for the host to handle on its existing storage. (http://tracker.ceph.com/issues/17415)

See http://tracker.ceph.com/projects/ceph-releases/wiki/Sepia/diff?utf8=%E2%9C%93&version=131&version_from=129&commit=View+differences for when OVH jobs were added.

Dan's going to do some testing to see if he can manually reproduce the system load and HTTP failures.

Actions

Copy link

Updated by David Galloway about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Dan Mick about 7 years ago

Description updated (diff)

Started a loop doing 10 simultaneous "git clone git://git.ceph.com/ceph.git", after having git gc'ed the repo, and eventually that made the http take 30s...so clearly we can choke the machine with fairly little load. Not certain what is the worst bit, but the longest time seems to be the 'git pack-objects' for the clone (unsurprisingly)

Actions

Copy link

Updated by Dan Mick about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by Dan Mick about 7 years ago

It turns out that each git fetch looks like it buries a CPU in usermode (snarf a bunch, alloc/alloc/alloc, write a bunch; probably consistency checking and compression), and we've recently doubled the number of kraken jobs, by duplicating them onto ovh at the exact same time of day. And it was a 6-cpu host, often 100% busy. So we've doubled the number of CPUs (virtual, of course) and we'll see if that helps out.

We could also think about getting smarter about the 'git clone' that all the workunit tests do, since they're all 1) now cloning the entire Ceph repo, and 2) doing it to multiple client hosts. I'm not certain of the easiest way there, but that's a lot of duplicate work that loads the server.

Greg Farnum suggested maybe a shallow clone might help, and indeed a --depth 1 clone goes MUCH faster than a full (7.6s vs 3m41s, ~5000 objects vs ~500000), so that's a good thing to pursue, I think.

Actions

Copy link

Updated by Dan Mick about 7 years ago

Proposing https://github.com/ceph/ceph/pull/14214

Actions

Copy link

Updated by Dan Mick about 7 years ago

https://github.com/ceph/ceph/pull/14214 is merged. It probably ought to be backported. Leaving this bug open to see if it makes the difference.

Actions

Copy link

Updated by David Galloway about 7 years ago

Status changed from New to Resolved

Machine's been stable since we added more vCPUs. I'm gonna leave them in even with the --max-depth fix for now. If we're hurting for vCPUs in the future, we'll consider taking some away then.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Infrastructure » sepia

Custom queries

Bug #19402

git.ceph.com instability

Updated by David Galloway about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by Dan Mick about 7 years ago

Updated by David Galloway about 7 years ago