Project

General

Profile

Actions

Bug #19402

closed

git.ceph.com instability

Added by David Galloway about 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

git.ceph.com has been failing to respond over HTTP a few times during early morning US hours the past few weeks. This may not matter all that much since it's just the web UI but we're unsure if it's also affecting jobs.

I've gathered some stats on what's happening on the machine during these times. Here's a 10 minute snapshot during an HTTP outage: https://paste.fedoraproject.org/paste/pCTM-~6yo4LP5yg3MgjLdl5M1UNdIGYhyRLivL9gydE=

Compare to quiet/stable time: https://paste.fedoraproject.org/paste/H6ptu0dxeEXjXpmppzPuE15M1UNdIGYhyRLivL9gydE=

My current hypothesis is the addition of jobs in OVH is too much for the host to handle on its existing storage. (http://tracker.ceph.com/issues/17415)

See http://tracker.ceph.com/projects/ceph-releases/wiki/Sepia/diff?utf8=%E2%9C%93&version=131&version_from=129&commit=View+differences for when OVH jobs were added.

Dan's going to do some testing to see if he can manually reproduce the system load and HTTP failures.

Actions #1

Updated by David Galloway about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Dan Mick about 7 years ago

  • Description updated (diff)

Started a loop doing 10 simultaneous "git clone git://git.ceph.com/ceph.git", after having git gc'ed the repo, and eventually that made the http take 30s...so clearly we can choke the machine with fairly little load. Not certain what is the worst bit, but the longest time seems to be the 'git pack-objects' for the clone (unsurprisingly)

Actions #3

Updated by Dan Mick about 7 years ago

  • Description updated (diff)
Actions #4

Updated by Dan Mick about 7 years ago

It turns out that each git fetch looks like it buries a CPU in usermode (snarf a bunch, alloc/alloc/alloc, write a bunch; probably consistency checking and compression), and we've recently doubled the number of kraken jobs, by duplicating them onto ovh at the exact same time of day. And it was a 6-cpu host, often 100% busy. So we've doubled the number of CPUs (virtual, of course) and we'll see if that helps out.

We could also think about getting smarter about the 'git clone' that all the workunit tests do, since they're all 1) now cloning the entire Ceph repo, and 2) doing it to multiple client hosts. I'm not certain of the easiest way there, but that's a lot of duplicate work that loads the server.

Greg Farnum suggested maybe a shallow clone might help, and indeed a --depth 1 clone goes MUCH faster than a full (7.6s vs 3m41s, ~5000 objects vs ~500000), so that's a good thing to pursue, I think.

Actions #6

Updated by Dan Mick about 7 years ago

https://github.com/ceph/ceph/pull/14214 is merged. It probably ought to be backported. Leaving this bug open to see if it makes the difference.

Actions #7

Updated by David Galloway about 7 years ago

  • Status changed from New to Resolved

Machine's been stable since we added more vCPUs. I'm gonna leave them in even with the --max-depth fix for now. If we're hurting for vCPUs in the future, we'll consider taking some away then.

Actions

Also available in: Atom PDF