Project

General

Profile

Actions

Bug #14172

closed

https://jenkins.ceph.com/job/ceph-pull-requests/ aka make check fixes

Added by Loïc Dachary over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Workaround

run-make-check.sh ; sudo reboot to cleanup lingering processes. This should allow the bot to run for weeks before it requires a re-image for some reason.

Isssues

  • running on unsupported operating systems (CentOS 6, precise and maybe others)
  • leftovers from a previous test (which should be removed when a new slave is provisionned for each test)
  • keep the last 300 jobs for forensic analysis (about one week worth)
  • re-enable the jenkins job
  • disable reporting to github pull requests so that the stability of the run can be verified without sending numerous false negative while doing so
Actions #1

Updated by Loïc Dachary over 8 years ago

  • Status changed from New to 12
  • Priority changed from Normal to Urgent

Setting to urgent because the absence of an automated make check has a noticeable daily impact on the work of Ceph developers

Actions #2

Updated by Loïc Dachary over 8 years ago

  • Subject changed from https://jenkins.ceph.com/job/ceph-pull-requests/ fixes to https://jenkins.ceph.com/job/ceph-pull-requests/ aka make check fixes
<alfredodeza> loicd: CI is not meant to be openstack only, and it does involve a bit more work other than to just call an external API to terminate the instance
<alfredodeza> the current state of CI is that there is no way to be able to spin up/down nodes
<loicd> alfredodeza: it's using OpenStack but is not able to spin up/down nodes ? How did that happen ? 
<loicd> isn't it what https://github.com/alfredodeza/mita/ is about ?
<alfredodeza> loicd: that is just one component of ci
<alfredodeza> which is meant to get nodes up and down
<alfredodeza> *but we are not there yet*
<alfredodeza> it needs to be worked on
<loicd> alfredodeza: in the meantime I believe it would be enough to just reboot after the test. There is a *very* small chance that the test will taint the file system (i.e. fill / or something hard to recover from). No more chance that any python test really. There however is a *high* chance that a process survives the test because of a bug and the reboot can take care of that. The better solution would be to re-image every time but that will allow the bot to resume service efficiently.
<loicd> alfredodeza: will jenkins automatically reconnect to the slave after a reboot ? or does it need some extra infrastructure work ?
<alfredodeza> automatic
<loicd> cool then :-)
Actions #3

Updated by Loïc Dachary over 8 years ago

  • Description updated (diff)
Actions #6

Updated by Andrew Schoen over 8 years ago

  • Status changed from Resolved to In Progress

The approach we took with two shell builders won't work because if the first fails, which is the make check, the second won't run.

Actions #7

Updated by Andrew Schoen over 8 years ago

If we move the reboot to a postbuildscript we can report the correct status to github always and only reboot the node on failures.

https://github.com/ceph/ceph-build/pull/287

Actions #8

Updated by Andrew Schoen over 8 years ago

  • Status changed from In Progress to Fix Under Review
Actions #9

Updated by Andrew Schoen over 8 years ago

  • Status changed from Fix Under Review to Resolved

https://github.com/ceph/ceph-build/commit/dcb9adf981ae8f69c63b933a6aaaffadf1d557ab

We also had to change the number of executors on the node that will run this job to 1 so that the node won't reboot while another job is running on the same node. That was done here: https://jenkins.ceph.com/computer/centos7+158.69.77.220/configure

Actions

Also available in: Atom PDF