Bug #1989: teuthology: error in ceph.log didn't make teutholgy return error code - Ceph - Ceph

Actions

Copy link

Bug #1989

closed

teuthology: error in ceph.log didn't make teutholgy return error code

Added by Sage Weil about 12 years ago. Updated about 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Josh Durgin

Category:

teuthology

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

in a bash while loop, I saw

INFO:teuthology.task.ceph:Checking cluster ceph.log for badness...
WARNING:teuthology.task.ceph:Found errors (ERR|WRN|SEC) in cluster log
INFO:teuthology.task.ceph:Cleaning ceph cluster...
INFO:teuthology.task.ceph:Removing ceph binaries...
INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump...
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.task.internal:Removing archive directory...
INFO:teuthology.task.internal:Tidying up after the test...
INFO:teuthology.run:Duration was 1216.594967 seconds
+ date

and the calling script blithely continued. that was

while bin/teuthology $job $2 $3 $4
do
    date
    N=$(($N+1))
    echo "$job: $N passes" 
    title
done
echo "$job: $N passes, then failure."

Actions

Copy link

Updated by Greg Farnum about 12 years ago

I thought we turned this off on purpose because thrashing always triggered it. Am I remembering incorrectly?

Actions

Copy link

Updated by Sage Weil about 12 years ago

we whitelist log entries. it only prints that (and sets success=False) if it sees something unexpected

        log.info('Checking cluster ceph.log for badness...')
        def first_in_ceph_log(pattern, excludes):
            args = [
                'egrep', pattern,
                '/tmp/cephtest/data/%s/log' % firstmon,
                ]
            for exclude in excludes:
                args.extend([run.Raw('|'), 'egrep', '-v', exclude])
            args.extend([
                    run.Raw('|'), 'head', '-n', '1',
                    ])
            r = mon0_remote.run(
                stdout=StringIO(),
                args=args,
                )
            stdout = r.stdout.getvalue()
            if stdout != '':
                return stdout
            return None

        if first_in_ceph_log('\[ERR\]|\[WRN\]|\[SEC\]',
                             config['log_whitelist']) is not None:
            log.warning('Found errors (ERR|WRN|SEC) in cluster log')
            ctx.summary['success'] = False
            # use the most severe problem as the failure reason
            if 'failure_reason' not in ctx.summary:
                for pattern in ['\[SEC\]', '\[ERR\]', '\[WRN\]']:
                    match = first_in_ceph_log(pattern, config['log_whitelist'])
                    if match is not None:
                        ctx.summary['failure_reason'] = \
                            '"{match}" in cluster log'.format(
                            match=match.rstrip('\n'),
                            )
                        break

the problem i see tho is that it prints errors found but doesn't return an error code

Actions

Copy link